CONCEPT DIABETES: Analytical pipeline

Author

Ignacio Oscoz Villanueva

Introduction

CONCEPT-DIABETES

CONCEPT-DIABETES is part of CONCEPT, a coordinated research project initiative under the umbrella of REDISSEC, the Spanish Research Network on Health Services Research and Chronic Conditions] [www.redissec.com], that aims at analyzing chronic care effectiveness and efficiency in a number of cohorts built on real world data (RWD). In the specific case of CONCEPT-DIABETES, the focus will be on assessing the effectiveness of a set of clinical practice actions and quality improvement strategies at different levels (patient-level, health-care provider level and health system level) on the management and health results of patients with type 2 diabetes (T2D) using process mining methodology.

It is a population-based retrospective observational study centered on all T2D patients diagnosed in four Regional Health Services within the Spanish National Health Service, that includes information from all their contacts with the health services using the electronic medical record systems including Primary Care data, Specialist Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register. We will assess to what extent recommended interventions from evidence-based guidelines are implemented in real life and which are their effects on health outcomes. Process mining methods will be used to analyze the data, and comparison with standard methods will be also conducted.

Cohort

The cohort is defined as patients with type 2 diabetes:

  • Inclusion criteria: Patients that, at 2017-01-01 or during the follow-up from 2017-01-01 to 2022-12-31, had active health card (active TIA - tarjeta sanitaria activa) and one of the inclusion codes given in the ‘inclusion code list (’T90’ if CIAP-2, ‘250’ if ‘CIE-9CM’ or ‘E11’ if CIE-10-ES).

  • Exclusion criteria: Patients that had none of the exclusion codes given in the exclusion codes list (‘T89’ if CIAP-2, ‘250.01’ if CIE-9CM, ‘250.03’ if CIE-9CM, ‘250.11’ if CIE-9CM, ‘250.13’ if CIE-9CM, ‘250.21’ if CIE-9CM, ‘250.23’ if CIE-9CM, ‘250.31’ if CIE-9CM, ‘250.33’ if CIE-9CM, ‘250.41’ if CIE-9CM, ‘250.43’ if CIE-9CM, ‘250.51’ if CIE-9CM, ‘250.53’ if CIE-9CM, ‘250.61’ if CIE-9CM, ‘250.63’ if CIE-9CM, ‘250.71’ if CIE-9CM, ‘250.73’ if CIE-9CM, ‘250.81’ if CIE-9CM, ‘250.83’ if CIE-9CM, ‘250.91’ if CIE-9CM, ‘250.93’ if CIE-9CM or ‘E10’ if CIE-10-ES) during their follow-up or patients with no contact with the health system from 2017-01-01 to 2022-12-31.

  • Study period: 2017-01-01 until 2022-12-31.

Treatment guidelines

One of the main intermediate outcome indicators to which clinical guidelines pay special attention is a good glycaemic control, since its absence is clearly related to micro and macrovascular complications. In clinical practice, suboptimal glycaemic control can be mainly attributed to two main reasons: the patients’ non-adherence to prescribed treatment; and the healthcare providers’ clinical or therapeutic guidelines’ non-adherence.

Treatment decisions are based on glycated hemoglobin measurements. In this context the redGDPS foundation provides DM2 treatment algorithm, a diagram that aims to help professionals to quickly and easily choose the most appropriate treatment for people with diabetes.

This browser does.

Process indicators

Another measures to which different studies pay special attention are process indicators. First, process indicators are essential for assessing the three dimensions of healthcare quality: effectiveness, safety, and patient-centeredness. They are measured relatively easily and often do not require risk-adjustment, making their interpretation straightforward. Poor performance on process indicators can be directly attributed to the actions of healthcare providers, providing a clear indication for improvement, such as better adherence to clinical guidelines. Moreover, process indicators allow for the identification of areas for quality improvement, which is crucial in the complex healthcare environment. These indicators cover a wide range of health factors and help focus resources and efforts to improve the health and well-being of the population. Therefore, the attention given to health process indicators is justified by their crucial role in evaluating and improving healthcare quality and population health.

Running code

The python libraries used are:

Code
import sys
import pm4py, subprocess
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import textdistance
import gensim.corpora as corpora
from tqdm import trange, tqdm
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering 
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.cluster import KElbowVisualizer
from pm4py.objects.petri_net.obj import PetriNet, Marking
from pm4py.visualization.decisiontree import visualizer as tree_visualizer
from pm4py.algo.decision_mining import algorithm as decision_mining
from pm4py.visualization.petri_net import visualizer
from gensim.models import LdaModel
from datetime import  datetime, timedelta
from dateutil.relativedelta import relativedelta
import duckdb
import re
import logging

The R libraries used are:

Code
library(tidyverse)
library(lubridate)
library(jsonlite)
library(ggplot2)
library(bupaR)
library(processmapR)
library(dplyr)
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(here)
library(survival)
library(lcmm)
library(coxme)
library(muhaz)
library(ggfortify)
library(bayestestR)
library(purrr)
library(duckdb)
library(logger)
library(finalfit)
library(flextable)
library(knitr)

Data preprocessing

First of all, it is important to prepare correctly the data for the analysis. The next function creates some sql views that are going to be useful later:

Code
def general_preprocess(con):
    '''
    Generating  views
    Args:
        con : db connector variable
    '''
    con.sql("DROP VIEW IF EXISTS cmbd_incidents_view;")
    con.sql("DROP VIEW IF EXISTS comorb_incidents_view;")    
    con.sql("DROP VIEW IF EXISTS ss_use_incidents_view;")
    con.sql("DROP VIEW IF EXISTS param_incidents_view_;")
    con.sql("DROP VIEW IF EXISTS param_incidents_view;")
    con.sql("DROP VIEW IF EXISTS param_cat_incidents_view;")    
    con.sql("DROP VIEW IF EXISTS patient_incidents_view;")
    con.sql("DROP VIEW IF EXISTS treat_incidents_view;")
    con.sql("DROP VIEW IF EXISTS exams_incidents_view;")

    con.sql("CREATE VIEW cmbd_incidents_view AS SELECT * FROM main.cmbd \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE VIEW ss_use_incidents_view AS SELECT * FROM main.ss_use \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE VIEW comorb_incidents_view AS SELECT * FROM main.comorb \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE VIEW param_incidents_view_ AS SELECT DISTINCT * FROM main.param \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01') \
                AND (param_value!=0 OR param_name!='bmi')")
    con.sql("CREATE VIEW param_incidents_view AS \
            SELECT DISTINCT patient_id,param_name,MAX(param_value) AS param_value,param_date\
        FROM param_incidents_view_ GROUP BY patient_id,param_name,param_date")
    con.sql("CREATE VIEW param_cat_incidents_view AS SELECT DISTINCT * FROM main.param_cat \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE VIEW patient_incidents_view AS SELECT *,\
            FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25) \
                AS 'age', \
            FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25) \
                AS 'age_dx', \
            DATE_ADD(month_nac, INTERVAL 75 YEAR) AS 'turn_to_75', \
            DATE_ADD(month_nac, INTERVAL 65 YEAR) AS 'turn_to_65' \
                FROM main.patient WHERE patient_id IN (SELECT patient_id \
                        FROM main.patient WHERE dx_date >='2017-01-01')")
    con.sql("CREATE VIEW treat_incidents_view AS SELECT * FROM main.treat \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE VIEW exams_incidents_view AS SELECT * FROM main.exams \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
                
    con.sql("DROP VIEW IF EXISTS param_incidents_view_age_date;")    
    con.sql("CREATE VIEW param_incidents_view_age_date AS SELECT param_incidents_view.patient_id, \
            param_name, param_value, param_date, turn_to_65, turn_to_75 FROM \
                param_incidents_view LEFT JOIN patient_incidents_view ON \
         param_incidents_view.patient_id = patient_incidents_view.patient_id")

    con.sql("DROP VIEW IF EXISTS cmbd_incidents_postdx_view")
    con.sql("CREATE VIEW cmbd_incidents_postdx_view AS \
            SELECT * FROM cmbd_incidents_view \
                WHERE admission_date > (SELECT dx_date \
                                        FROM patient_incidents_view \
    WHERE patient_incidents_view.patient_id = cmbd_incidents_view.patient_id)")
                    
    con.sql("DROP VIEW IF EXISTS cmbd_incidents_postdx_first_view")
    con.sql("CREATE VIEW cmbd_incidents_postdx_first_view AS \
            SELECT patient_id, admission_date, diagnosis1_code FROM (SELECT *,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id \
                                        ORDER BY admission_date) AS rn \
                          FROM cmbd_incidents_postdx_view) t WHERE rn = 1;")
                
    con.sql("DROP VIEW IF EXISTS param_incidents_predx_view;")
    con.sql("CREATE VIEW param_incidents_predx_view AS SELECT * FROM (\
            SELECT * FROM (SELECT patient_id, param_name, param_value, param_date \
                            FROM param_incidents_view) sub \
                WHERE param_date <= (SELECT dx_date FROM patient_incidents_view \
                    WHERE patient_incidents_view.patient_id = sub.patient_id));")
        
    con.sql("DROP VIEW IF EXISTS param_incidents_predx_last_view;")        
    con.sql("CREATE VIEW param_incidents_predx_last_view AS \
            SELECT patient_id, param_name, param_date, param_value \
                FROM (SELECT patient_id, param_name, param_date, param_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
                                        ORDER BY param_date DESC) AS rn \
                          FROM param_incidents_predx_view \
                          WHERE param_value!=9999) t WHERE rn = 1;")
                    
    con.sql("DROP VIEW IF EXISTS param_cat_incidents_predx_view;")                    
    con.sql("CREATE VIEW param_cat_incidents_predx_view AS SELECT * FROM (\
            SELECT * FROM (SELECT patient_id, param_cat_name, param_cat_value, param_cat_date \
                            FROM param_cat_incidents_view) sub \
                WHERE param_cat_date <= (SELECT dx_date FROM patient_incidents_view \
                    WHERE patient_incidents_view.patient_id = sub.patient_id));")
    con.sql("DROP VIEW IF EXISTS param_cat_incidents_predx_last_view;")
    con.sql("CREATE VIEW param_cat_incidents_predx_last_view AS \
            SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
                FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
                                        ORDER BY param_cat_date DESC) AS rn \
                          FROM param_cat_incidents_predx_view) t WHERE rn = 1;") 
                    
    con.sql("DROP VIEW IF EXISTS param_prevalents_pre_view;")
    con.sql("CREATE VIEW param_prevalents_pre_view AS \
            SELECT DISTINCT patient_id, param_name, param_value, param_date FROM main.param \
             WHERE param_date < '2017-01-01' AND patient_id IN (\
                SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
    con.sql("DROP VIEW IF EXISTS param_prevalents_pre_last_view;")
    con.sql("CREATE VIEW param_prevalents_pre_last_view AS \
            SELECT patient_id, param_name, param_date, param_value \
                FROM (SELECT patient_id, param_name, param_date, param_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
                                        ORDER BY param_date DESC) AS rn \
                          FROM param_prevalents_pre_view \
                              WHERE param_value!=9999) t WHERE rn = 1;")
                    
    con.sql("DROP VIEW IF EXISTS param_cat_prevalents_pre_view;")
    con.sql("CREATE VIEW param_cat_prevalents_pre_view AS \
            SELECT DISTINCT * FROM main.param_cat \
             WHERE param_cat_date < '2017-01-01' AND patient_id IN (\
                SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
    con.sql("DROP VIEW IF EXISTS param_cat_prevalents_pre_last_view;")
    con.sql("CREATE VIEW param_cat_prevalents_pre_last_view AS \
            SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
                FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
                                        ORDER BY param_cat_date DESC) AS rn \
                          FROM param_cat_prevalents_pre_view) t WHERE rn = 1;") 

Treatments’ event log creation

For using process mining an event log is needed. The sort of functions below take data model’s treatment and parameter tables to create an event log of glycated hemoglobin measures and treatment of patients. Glycated hemoglobin measurement events are divided into two different states, those that have a value smaller than “L” and the others. L’s value depends on patient’s diabetes duration, age and comorbilities and complications. Therefore with the help of expert clinicians we decided that L would take the following values:

  • L=7 if age<75 and no cardiovascular disease, heart failure, chronic kidney disease or fragility.
  • L=8 if age<65 and any cardiovascular disease, heart failure, chronic kidney disease or fragility.
  • L=8.5 if else.

As they are measures these events do not have any duration. In the case of treatments on the other hand a duration period exists. For treatments, events definition is based on drugs prescriptions if exists or dispensing dates and a fixed period time if else. Functions below make event logs with the previous considerations. Treatment analysis of this document is thought to do by the predominant clinical condition of patients according to DM2 treatment algorithm. In other words, patients are grouped by their predominant clinical condition and the analysis is realized independently to each group.

Code
logging.basicConfig(level=logging.INFO)
def separate_patients_by_condition(con):
    '''
    Filtering data by predominant clinical condition according to redGDPS
    Args:
        con : db connector variable
    Return:
        presc_data_p (float): percentage of not null data in prescription dates
        
    https://www.sciencedirect.com/science/article/pii/S1751991821000176?via%3Dihub#tbl0015
    https://www.sanidad.gob.es/estadEstudios/estadisticas/estadisticas/estMinisterio/SIAP/map_cie9mc_cie10_ciap2.htm
    
     obesity (BMI≥30 kg/m2), age older than 75, established CVD (defined as
     myocardial infarction, ischemic heart disease, cerebrovascular disease,
     or peripheral arterial disease), CKD (defined as eGFR < 60 ml/min/
     1.73 m2 and/or UACR ≥ 30 mg/g) and HF. 
    '''
    cvd_code_list = ['K76','K75','K76','K91','K92']
    hf = 'K77'
    ckd = 'U99.01'
    ob = "T82"

    to_list = lambda l: "('"+"','".join(l)+"')"     
                    
                    
    presc_data_p = con.sql(f" SELECT (COUNT(prescription_date_ini)) / COUNT(*) \
                           AS p FROM treat_incidents_view").fetchall()[0][0]
    cie9_df = pd.read_csv('./CIE9.csv').rename(columns={'CIE9':'CIE',
                                                        'BDCAP':'CIAP'})
    cie9_df['comorb_codif'] = 'ICD9'
    cie10_df = pd.read_csv('./CIE10.csv').rename(columns={'CIE10':'CIE',
                                                          'BDCAP':'CIAP'})
    cie10_df['comorb_codif'] = 'ICD10'
    cie2ciap = pd.concat([cie9_df[['comorb_codif','CIE','CIAP']],
                          cie10_df[['comorb_codif','CIE','CIAP']]])
    
    con.sql("DROP VIEW IF EXISTS comorb_incidents_ciap_view;")    
    con.sql("DROP VIEW IF EXISTS comorb_incidents_active_view;")    
    con.sql("CREATE VIEW comorb_incidents_ciap_view AS \
            SELECT * FROM comorb_incidents_view \
                LEFT JOIN cie2ciap \
                    ON comorb_incidents_view.comorb_codif = cie2ciap.comorb_codif \
                        AND comorb_incidents_view.comorb_code = cie2ciap.CIAP")
    con.sql("CREATE VIEW comorb_incidents_active_view AS SELECT *, \
            CASE WHEN comorb_codif = 'CIAP2' THEN comorb_code ELSE CIAP END AS CIAP2 \
                FROM comorb_incidents_ciap_view WHERE comorb_date_end IS NULL;")

    # Which is the predominal clinical condition at baseline?
    f_list = set()
    ob_list = set(con.sql(f"SELECT DISTINCT patient_id \
                          FROM comorb_incidents_active_view WHERE CIAP2 = '{ob}'"
                          ).df()['patient_id'])        
    ckd_list = set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM comorb_incidents_active_view WHERE CIAP2 = '{ckd}'"
                           ).df()['patient_id'])        
    hf_list = set(con.sql(f"SELECT DISTINCT patient_id \
                          FROM comorb_incidents_active_view WHERE CIAP2 = '{hf}'"
                          ).df()['patient_id'])        
    cvd_list = set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM comorb_incidents_active_view \
                               WHERE CIAP2 IN {to_list(cvd_code_list)}"
                               ).df()['patient_id'])        

    # Which is the predominal clinical condition at dx_date?
    # ENFERMEDADES ISQUÉMICAS CARDIACAS (I20-I25)
    # ENFERMEDADES CEREBROVASCULARES (I60-I69)
    # ATEROESCLEROSIS (ENFERMEDADES VASCULARES PERIFÉRICAS) (I70)
    # OTRAS ENFERMEDADES VASCULARES PERIFÉRICAS (I73)
    # EMBOLIA Y TROMBOSIS ARTERIAL (ENFERMEDADES VASCULARES PERIFÉRICAS) (I74)
    to_cvd_change_prelist = ['I2%i' %i for i in range(6)]+[
                             'I6%i' %i for i in [1,3,4,5,6,7,8,9]]+[
                                'I7%i' %i for i in [0,3,4]]
    to_cvd_change_list = []
    for code in to_cvd_change_prelist:
     for c in cie10_df['CIE']:
      if code in c and c not in ['I51.4','I60','I62','I67.1','I68.2','I67.5']:
                to_cvd_change_list.append(c)
    # INSUFICIENCIA CARDÍACA (I50)
    # ENFERMEDAD CARDÍACA Y RENAL CRÓNICA HIPERTENSIVA (I13)
    to_hf_change_list = ['I09.81', 'I11.0', 'I13.0', 'I13.2', 'I50', 'I50.0',
                         'I50.1', 'I50.9']
    # ENFERMEDAD RENAL CRÓNICA (N18)
    # ENFERMEDAD RENAL CRÓNICA HIPERTENSIVA (I12)
    to_ckd_change_list = ['N18','N18.1','N18.2','N18.3','N18.30','N18.31',
                          'N18.32','N18.4','N18.5','N18.6','N18.9','I12',
                          'I12.0','I12.9','filtglom']
    to_f_change_list = ['f_date']
    to_ob_change_list = ['bmi>=30','E66.01','E66.09','E66.1','E66.2',
                         'E66.8','E66.9']
    # SOBREPESO Y OBESIDAD (E66)
    from_cvd_change_list = ['death']
    from_hf_change_list = from_cvd_change_list+to_cvd_change_list
    from_ckd_change_list = from_hf_change_list+to_hf_change_list
    from_f_change_list = from_ckd_change_list+to_ckd_change_list
    from_ob_change_list = from_f_change_list+to_f_change_list+['bmi<30']
    from_else_change_list = from_ob_change_list[:-1]+to_ob_change_list
    
    relevance_order = pd.DataFrame()
    relevance_order['code'] = from_ob_change_list+to_ob_change_list+['end']
    relevance_order['order'] = range(len(relevance_order))

    
       
    # https://ukkidney.org/health-professionals/information-resources/uk-eckd-guide/ckd-stages
    # Include obesity and kcd detection by parameters values                             
    param_filt = con.sql("SELECT * FROM param_incidents_view \
                         WHERE param_name = 'filtglom' AND param_value<60;"
                         ).df().sort_values(['patient_id','param_date'])
    param_filt['time_diff'] = param_filt.groupby(
        'patient_id')['param_date'].diff()
    cmbd_filt = param_filt[param_filt['time_diff'] < pd.Timedelta(days=90)
                           ].drop_duplicates(subset='patient_id',keep='first')[
                               ['patient_id','param_date','param_name']
                               ].rename(columns={
                                   'param_name':'code',
                                   'param_date':'admission_date'})
    con.sql("DROP VIEW IF EXISTS cmbd_incidents_long_view;")
    con.sql("CREATE VIEW cmbd_incidents_long_view AS SELECT *,\
            COALESCE(t.order, 404) AS relevance FROM\
            (SELECT patient_id, admission_date, diagnosis1_code AS code \
                FROM cmbd_incidents_view WHERE diagnosis1_code IS NOT NULL \
            UNION ALL SELECT patient_id, admission_date, diagnosis2_code AS code \
                FROM cmbd_incidents_view WHERE diagnosis2_code IS NOT NULL \
            UNION ALL SELECT patient_id, admission_date, diagnosis3_code AS code \
                FROM cmbd_incidents_view WHERE diagnosis3_code IS NOT NULL \
            UNION ALL SELECT patient_id, admission_date, code FROM cmbd_filt \
            UNION ALL SELECT patient_id, turn_to_75 AS admission_date, 'f_date' \
                AS code FROM patient_incidents_view WHERE turn_to_75<'2023-01-01'\
            UNION ALL SELECT patient_id, death_date AS admission_date, 'death' \
                AS code FROM patient_incidents_view WHERE death_date IS NOT NULL\
            UNION ALL SELECT patient_id, DATE '2023-01-01' AS admission_date, 'end' \
                AS code FROM patient_incidents_view\
            UNION ALL SELECT patient_id,param_date AS admission_date, \
        CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
            FROM param_incidents_predx_last_view WHERE param_name = 'bmi' \
            UNION ALL SELECT patient_id, param_date AS admission_date, \
        CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
            FROM param_incidents_view sub WHERE param_name = 'bmi' \
                AND  param_date > (SELECT dx_date FROM patient_incidents_view \
                                   WHERE patient_incidents_view.patient_id = sub.patient_id)) v \
                LEFT JOIN relevance_order t ON v.code=t.code;")
    
    f_list.update(set(con.sql("SELECT DISTINCT patient_id \
                              FROM patient_incidents_view WHERE turn_to_75<=dx_date"
                              ).df()['patient_id']))
    con.sql("DROP VIEW IF EXISTS ob_events;")
    con.sql(f"CREATE VIEW ob_events AS SELECT * \
            FROM cmbd_incidents_long_view sub \
                WHERE code IN {to_list(to_ob_change_list)} \
                    AND admission_date <= (SELECT dx_date \
                    FROM patient_incidents_view \
                        WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            )
    ob_list.update(set(con.sql("SELECT DISTINCT patient_id \
                           FROM ob_events").df()['patient_id']))
        
    ob_list = ob_list - set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code='bmi<30' \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id) \
                                   AND admission_date > (SELECT MAX(admission_date)\
                        FROM ob_events \
                            WHERE ob_events.patient_id  = sub.patient_id);"
                                   ).df()['patient_id'])
    ckd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code IN {to_list(to_ckd_change_list)} \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            ).df()['patient_id']))        
    hf_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code IN {to_list(to_hf_change_list)} \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            ).df()['patient_id']))        
    cvd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code IN {to_list(to_cvd_change_list)} \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            ).df()['patient_id']))        
      
    
    cvd_list = list(cvd_list)
    hf_list = list(hf_list.difference(cvd_list))
    ckd_list = list(ckd_list.difference(cvd_list+hf_list))
    f_list = list(f_list.difference(cvd_list+hf_list+ckd_list))
    ob_list = list(ob_list.difference(cvd_list+hf_list+ckd_list+f_list))
    else_list = list(set(con.sql("SELECT DISTINCT patient_id \
                                 FROM patient_incidents_view"
                                 ).df()['patient_id']
                         ).difference(cvd_list+hf_list+ckd_list+f_list+ob_list))
               
    # When is a change on predominant clinical condition made?  
    con.sql("DROP VIEW IF EXISTS cmbd_incidents_long_view_dx")
    con.sql(f"CREATE VIEW cmbd_incidents_long_view_dx AS SELECT * \
            FROM cmbd_incidents_long_view sub \
                WHERE admission_date > (SELECT dx_date \
                    FROM patient_incidents_view \
                        WHERE patient_incidents_view.patient_id = sub.patient_id);")

    query_function =  lambda cond: "(WITH rankedevents_*** AS (\
        SELECT patient_id,code,admission_date,ROW_NUMBER() \
          OVER(PARTITION BY patient_id ORDER BY admission_date,relevance) AS event_rank \
              FROM cmbd_incidents_long_view_dx WHERE  patient_id IN {to_list(***_list)} \
                  AND code IN {to_list(from_***_change_list)} ) \
            SELECT patient_id, '***' AS type,code, admission_date AS date \
                FROM rankedevents_*** WHERE  event_rank=1 AND date<'2023-01-01') \
                                    UNION ALL (WITH rankedevents__*** AS (\
        SELECT patient_id,code,admission_date,ROW_NUMBER() \
          OVER(PARTITION BY patient_id ORDER BY admission_date) AS event_rank \
              FROM cmbd_incidents_long_view_dx WHERE  patient_id IN {to_list(***_list)} \
                  AND code IN {to_list(['end']+from_***_change_list)} ) \
            SELECT patient_id, '***' AS type,code, admission_date AS date \
                FROM rankedevents__*** WHERE  event_rank=1 AND code='end')\
                ".replace('***',cond)
    con.sql("DROP TABLE IF EXISTS patient_condition;")    
    con.sql(eval('f"CREATE TABLE patient_condition AS {}"'.format(' UNION ALL '.join(
        [query_function(cond) for cond in ['cvd','hf','ckd','f','ob','else']]))))    
    return presc_data_p
                

def evlog_creation_by_prescriptions(con,
                                 cond,
                                 code2drug_info_path='./diabetes_drugs.csv'):
    '''Preprocessing and event log obtention with prescriptions
  
    Args:
      con : db connector variable
      cond (str): predominal clinical condition's code
      code2drug_info_path (str): drugs' and their codes' info's table's path
  
    ADNI: ANTIDIABETICOS NO INSULINICOS
  
    The treatment of type 2 diabetes mellitus with ADNI includes a wide range of 
    drugs which, depending on their drugs which, according to their mechanisms of 
    action, can be grouped as follows: 
     Increased endogenous insulin sensitivity:
        o Biguanides: metformin (MET).
        o Thiazolidinediones: pioglitazone (PIO).
     Increased endogenous insulin secretion/release:
        o Sulfonylureas (SU).
        o Meglitinides: repaglinide (REP)
     Reduction of digestive glucose absorption:
        o Alpha-glucosidase inhibitors.
        o Vegetable fiber and derivatives.
     Incretin effect enhancers.
        o Inhibitors of DPP-4 (iDPP-4).
        o GLP-1 analogues (aGLP-1).
     Inhibitors of renal glucose reuptake
        o SGLT-2 inhibitors (iSGLP-2)

    '''
    con.sql(f"DROP TABLE IF EXISTS evlog_raw_{cond}")
    fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
                       WHERE type = '{cond}'").df()
    dx_date = con.sql(f"SELECT patient_id, dx_date FROM  patient_incidents_view \
                      WHERE patient_id IN (SELECT patient_id \
                         FROM patient_condition \
                             WHERE patient_condition.type = '{cond}')").df()
    dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
    fin_date = dict(zip(fin_date.patient_id,fin_date.date))
    code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
    code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
                                           ).get('class','NONE2'
                                                 ).replace('+','&')
    treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
                       WHERE patient_id IN (SELECT patient_id \
                                            FROM patient_condition \
                                                WHERE patient_condition.type = '{cond}') \
                           AND substring(atc_code,1,3) = 'A10'").df()
    treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
    treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])
                        ].drop_duplicates(subset=['patient_id',
                                                  'prescription_date_ini',
                                                  'prescription_date_end',
                                                  'Event'])
    treat_df['actins'] = range(len(treat_df))
    treat_df_start = treat_df[['patient_id','prescription_date_ini',
                              'Event','actins']].rename(columns = 
                                       {'prescription_date_ini':'date'})
    treat_df_end = treat_df[['patient_id','prescription_date_end',
                              'Event','actins']].rename(columns = 
                                       {'prescription_date_end':'date'})
    treat_df_start['cycle'] = 'start'
    treat_df_end['cycle'] = 'end'
    treat_df_end['date'] = treat_df_end['date'].fillna(
        datetime.strptime('2022-12-31', "%Y-%m-%d")).apply(
        lambda x: min([x,datetime.strptime('2023-01-01', "%Y-%m-%d")])).apply(
                                              lambda x: x-timedelta(days=1))
    act_n = max(treat_df['actins'])
    if cond in ['cvd','hf','ckd','f']:
        param_df = con.sql(f"SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {8} THEN 'HBA<L' \
                               WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date < turn_to_65 THEN 'HBA>L' \
                               WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date >= turn_to_65 THEN 'HBA<L' \
                               ELSE 'HBA>L' END AS Event,\
                           ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                    AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL").df()
    else:
        param_df = con.sql(f"SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {7} THEN 'HBA<L' \
                               WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date < turn_to_75 THEN 'HBA>L' \
                               WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date >= turn_to_75 THEN 'HBA<L' \
                               ELSE 'HBA>L' END AS Event,\
                           ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                    AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL").df()        

    param_df_start = param_df.copy()
    param_df_start['cycle'] = 'start'
    param_df_end = param_df.copy()
    param_df_end['cycle'] = 'end'

    df = pd.concat([param_df_start,treat_df_start,param_df_end,treat_df_end]
                   ).drop_duplicates()
    
    df = df.sort_values(by = ['patient_id','date','Event','cycle'],
                        ascending = [True,True,True,False])
    df.index = range(len(df))
    patient_list = list(set(df['patient_id']))
    df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))]

    #####################################################################
    
    event_log = dict()
    event_log['patient_id'] = []
    event_log['date'] = []
    event_log['nid'] = []
    event_log['Event'] = []
    event_log['cycle'] = []
    days_after_m = 5
    id_list = list(set(df['patient_id']))
    events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
    hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
    row = len(events)
    no_hba = [i for i in range(row) if i not in [hba0,hba1]]
    for id in tqdm(id_list):
        df_id = df[df['patient_id']==id]
        nid = id_list.index(id)
        actins = set(df_id['actins'])
        date_min = min([min(set(df_id['date'])),dx_date[id]])
        date_max =  max(set(df_id['date']))
        if str(fin_date[id])!='nan':
            date_max = max([date_max,fin_date[id]])
        else:
            date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
        dd = [date_min + timedelta(days=x) for x in range(
          (date_max-date_min).days + 1)]
        col = len(dd)
        ev_status = np.zeros((row,col))
        for act in actins:
            ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
                                                  df_id['cycle']=='start')])[0]
            ev =  list(df_id['Event'][df_id['actins']==act])[0]            
            fin = list(df_id['date'][np.logical_and(df_id['actins']==act,
                                                  df_id['cycle']=='end')])[0]
            ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
            ev_status = change_matrix_values(ev_status,ev_cols,
                                             list(range(dd.index(ini),
                                                        dd.index(fin)+1)),1)
                            
        measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
                                        ev_status[hba1,:].nonzero()[0])))
        ev_status = ev_status.astype(int)              
        #delete treatments in measure events        
        ev_status = change_matrix_values(ev_status,no_hba,measures,0)
        #correction to eliminate events after hemoglobin measurement that are
        #the continuation of the previous treatment and not the new treatment.
        #Example: If a patient has 'A' treatment and because of hemoglobin's
        #         measure they change their treatment to 'B', originally their
        #         trace could appear as A>measure>A>B, but the true trace
        #         should be A>measure>B. Therefore, in a fixed period time 
        #         after each measurement 'days_after_m', if we detect that 
        #         mistake we delete it. 

        for j in measures:
            if ev_status.shape[1]<=j+1 or days_after_m<3:
                continue
            first_col = ev_status[:,j+1]
            dif_col = np.where(np.any(ev_status[:, j+2:j+days_after_m]!=
                                      first_col[:, np.newaxis],
                                      axis=0))[0]
          
            if  (dif_col.size > 0) and (not
                    np.array_equal(ev_status[:,j-1],
                                    ev_status[:,j+dif_col[0]+2])) and (
                    np.array_equal(ev_status[:,j-1],
                                    ev_status[:,j+1])):
                ev_status = change_matrix_values(ev_status,no_hba,
                                                  range(j+1,j+dif_col[0]+2),0)

          
        col = dd.index(dx_date[id])
        col_0 = dd.index(dx_date[id])
        col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
        min_change_days = 7
        while col<col_max-1:
            if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
                ev = col2treat(ev_status,events,col_0)
                if  ('HBA' not in ev and col-col_0<min_change_days):
                    col+=1
                    col_0 = col
                    continue
                event_log['patient_id'].extend([id,id])
                event_log['date'].extend([dd[col_0],dd[col]])
                event_log['Event'].extend([ev,ev])
                event_log['nid'].extend([nid,nid])
                event_log['cycle'].extend(['start','end'])
                
                col+=1
                col_0 = col
            else:
                col+=1
        ev = col2treat(ev_status,events,col_0)
        if 'HBA' in ev or col-col_0>=min_change_days:
            event_log['patient_id'].extend([id,id])
            event_log['date'].extend([dd[col_0],dd[col]])
            event_log['Event'].extend([ev,ev])
            event_log['nid'].extend([nid,nid])
            event_log['cycle'].extend(['start','end'])
    
            
    evlog = pd.DataFrame.from_dict(event_log)   
    evlog = evlog.sort_values(['patient_id','date'])
    evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
    evlog.index = range(len(evlog))
    evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
    evlog['actins'] = [n//2 for n in range(len(evlog))]
    con.sql(f"CREATE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
 
def evlog_creation_by_dispensations(con, 
                              cond,
                              code2drug_info_path='./diabetes_drugs.csv',
                              nac_path='./Nomenclator_de_Facturacion.csv'):
    '''Preprocessing and event log obtention with dispensations

    Args:
      con : db connector variable
      cond (str): predominal clinical condition's code
      code2drug_info_path (str): drugs' and their codes' info's table's path
      nac_path (str): drugs' and their containers' info's table's path
    
    ADNI: ANTIDIABETICOS NO INSULINICOS
    
    The treatment of type 2 diabetes mellitus with ADNI includes a wide range of 
    drugs which, depending on their drugs which, according to their mechanisms of 
    action, can be grouped as follows: 
     Increased endogenous insulin sensitivity:
        o Biguanides: metformin (MET).
        o Thiazolidinediones: pioglitazone (PIO).
     Increased endogenous insulin secretion/release:
        o Sulfonylureas (SU).
        o Meglitinides: repaglinide (REP)
     Reduction of digestive glucose absorption:
        o Alpha-glucosidase inhibitors.
        o Vegetable fiber and derivatives.
     Incretin effect enhancers.
        o Inhibitors of DPP-4 (iDPP-4).
        o GLP-1 analogues (aGLP-1).
     Inhibitors of renal glucose reuptake
        o SGLT-2 inhibitors (iSGLP-2)
    
    '''

    con.sql(f"DROP TABLE IF EXISTS evlog_raw_{cond}")
    treat_min_days = 2
    days_error = 60-treat_min_days
    days_after_m = 7
    fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
                       WHERE type = '{cond}'").df()
    dx_date = con.sql(f"SELECT patient_id, dx_date FROM  patient_incidents_view \
                      WHERE patient_id IN (SELECT patient_id \
                         FROM patient_condition \
                             WHERE patient_condition.type = '{cond}')").df()
    dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
    fin_date = dict(zip(fin_date.patient_id,fin_date.date))
    code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
    code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
                                           ).get('class','NONE2'
                                                 ).replace('+','&')
    treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
                       WHERE patient_id IN (SELECT patient_id \
                                            FROM patient_condition \
                                                WHERE patient_condition.type = '{cond}') \
                           AND substring(atc_code,1,3) = 'A10'").df()
    treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
    treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])]
    nac_dict = nac2comprimidos(nac_path,set(treat_df['nac_code']))
    treat_df['nac_code'] = treat_df['nac_code'].apply(
        lambda x: nac_dict.get(x,days_error))
    treat_df['containers_number'] = treat_df['containers_number'].fillna(1)
    treat_df['nac_code'] = treat_df['nac_code']*treat_df['containers_number']
    treat_df = treat_df[['patient_id','dispensing_date','Event','nac_code']
                        ].rename(columns = {'dispensing_date':'date'}
                                 ).sort_values(by = ['patient_id','date' ],
                      ascending = [True, True])
                     
    if cond in ['cvd','hf','ckd','f']:
        df = con.sql(f"SELECT *, 'start' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {8} THEN 'HBA<L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date < turn_to_65 THEN 'HBA>L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date >= turn_to_65 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                UNION ALL SELECT * FROM treat_df) \
                    UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, param_date,\
                           CASE WHEN param_value < {8} THEN 'HBA<L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date < turn_to_65 THEN 'HBA>L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date >= turn_to_65 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                UNION ALL SELECT patient_id,\
                                CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
                                    DAYS AS date, date AS param_date, \
                                        Event, nac_code FROM treat_df)").df()
    else:
        df = con.sql(f"SELECT *, 'start' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {7} THEN 'HBA<L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date < turn_to_75 THEN 'HBA>L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date >= turn_to_75 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                UNION ALL SELECT * FROM treat_df) \
                    UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, param_date,\
                           CASE WHEN param_value < {7} THEN 'HBA<L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date < turn_to_75 THEN 'HBA>L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date >= turn_to_75 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                UNION ALL SELECT patient_id,\
                                CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
                                    DAYS AS date, date AS param_date, \
                                        Event, nac_code FROM treat_df)").df()

    patient_list = list(set(df['patient_id']))
    df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))] 
    df = df.sort_values(by = ['patient_id', 'actins','cycle'],
                        ascending = [True, True, False])
    df.index = range(len(df))
    
    #####################################################################

    event_log = dict()
    event_log['patient_id'] = []
    event_log['date'] = []
    event_log['nid'] = []
    event_log['Event'] = []
    event_log['cycle'] = []
    
    id_list = list(set(df['patient_id']))
    events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
    hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
    row = len(events)
    no_hba = [i for i in range(row) if i not in [hba0,hba1]]
    for id in tqdm(id_list):
        df_id = df[df['patient_id']==id]
        nid = id_list.index(id)
        actins = set(df_id['actins'])
        date_min = min([min(set(df_id['date'])),dx_date[id]])
        date_max =  max(set(df_id['date']))
        if str(fin_date[id])!='nan':
            date_max = max([date_max,fin_date[id]])
        else:
            date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
        dd = [date_min + timedelta(days=x) for x in range(
          (date_max-date_min).days + 1)]
        col = len(dd)
        ev_status = np.zeros((row,col))
        max_nac = int(max(df_id['nac_code'])*1.2)
        #event matrix where columns are days and rows treatments
        #0:no treatment, 1:treatment, 2:posible treatment,
        #3:after measuring period
        for act in actins:
            ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
                                                  df_id['cycle']=='start')])[0]
            ev =  list(df_id['Event'][df_id['actins']==act])[0]            
            fin = ini+timedelta(days=treat_min_days) if 'HBA' not in ev else ini
            ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
            ev_status = change_matrix_values(ev_status,ev_cols,
                                             list(range(dd.index(ini),
                                                        dd.index(fin)+1)),1)
            if 'HBA' not in ev:
                possible_day_duration = int(list(
                    df_id['nac_code'][np.logical_and(df_id['actins']==act,
                    df_id['cycle']=='start')])[0]*1.2)
                until_day = min([dd.index(ini)+possible_day_duration,col])
                ev_status = change_matrix_values(ev_status,ev_cols,
                                                 list(range(dd.index(fin),
                                                            until_day)),2)
                            
                                        
                
                
        measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
                                        ev_status[hba1,:].nonzero()[0])))
        #event compaction.
        #Example: If a patient has 'A' treatment and they are dispensing 'A'
        #         drug multiple times, their trace could appear as A>A>A>A, 
        #         but the true trace should appear as 'A' treatment lasting
        #         its corresponding time. Therefore, if the difference between
        #         the last day and the first day of two consecutive same drugs's
        #         dispensation is less than the number of tablets of the 
        #         container two events are jointed in one with the first day of
        #         the first event as initial date and last day of the last 
        #         event as the end date.
        ev_status = ev_status.astype(int)
        for i in no_hba:
            seq = str(list(ev_status[i,:]))
            for delta_days in range(1,max_nac):
                pattern = ', '+str([1]+[2]*delta_days+[1])[1:-1]
                new_pattern = ', '+str([1]+[1]*delta_days+[1])[1:-1]
                seq = seq.replace(pattern,new_pattern)
            ev_status[i,:] = eval(seq) 
              
        ev_status = change_matrix_values(ev_status,no_hba,measures,3)
        for i,j in list(itertools.combinations(no_hba,2)):
            seq_i = str(list(ev_status[i,:]))
            seq_j = str(list(ev_status[j,:]))
            for delta_days in range(1,max_nac):
                pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                if re.search(pattern_i,seq_i)==None:
                    continue
                pattern_j = ', '+str([1]+[2]*delta_days+[3])[1:-1]
                new_pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                seq_j = seq_j.replace(pattern_j,new_pattern_j)
            for delta_days in range(1,max_nac):
                pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                if re.search(pattern_j,seq_j)==None:
                    continue
                pattern_i = ', '+str([1]+[2]*delta_days+[3])[1:-1]
                new_pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                seq_i = seq_i.replace(pattern_i,new_pattern_i)
            ev_status[i,:] = eval(seq_i)
            ev_status[j,:] = eval(seq_j)

        measures_w_dp = [[i+d_p for d_p in range(days_after_m+1)
                          if i+d_p<col] for i in measures]    
        measures_w_dp = list(itertools.chain(*measures_w_dp))
        ev_status = change_matrix_values(ev_status,no_hba,measures_w_dp,3)

        for i in no_hba:
            seq = str(list(ev_status[i,:]))
            seq = seq.replace('2','0').replace('3','0')
            ev_status[i,:] = eval(seq)               
                
        col = dd.index(dx_date[id])
        col_0 = dd.index(dx_date[id])
        col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
        min_change_days = 42
        while col<col_max-1:
            if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
                ev = col2treat(ev_status,events,col_0)
                if  (ev=='_' and col-col_0<min_change_days):
                    col+=1
                    col_0 = col
                    continue
                event_log['patient_id'].extend([id,id])
                event_log['date'].extend([dd[col_0],dd[col]])
                event_log['Event'].extend([ev,ev])
                event_log['nid'].extend([nid,nid])
                event_log['cycle'].extend(['start','end'])
                
                col+=1
                col_0 = col
            else:
                col+=1
        ev = col2treat(ev_status,events,col_0)
        if ev!='_' or col-col_0>=min_change_days:
            event_log['patient_id'].extend([id,id])
            event_log['date'].extend([dd[col_0],dd[col]])
            event_log['Event'].extend([ev,ev])
            event_log['nid'].extend([nid,nid])
            event_log['cycle'].extend(['start','end'])
    
            
    evlog = pd.DataFrame.from_dict(event_log)   
    evlog = evlog.sort_values(['patient_id','date'])
    evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
    evlog.index = range(len(evlog))
    evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
    evlog['actins'] = [n//2 for n in range(len(evlog))]
    con.sql(f"CREATE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
  
def change_matrix_values(matrix,list_rows,list_cols, new_value=0):
    '''change selected rows' and columns' matrix`s values
    Args:
      matrix (array): matrix wanted to modify
      list_rows (list): matrix's rows wanted to modify
      list_cols (list): matrix's columns wanted to modify
      new_value (int): new wanted value
      
    Output:
      matrix (array): modified matrix
    '''
    pos = list(itertools.product(list_rows,list_cols))
    for i,j in pos:
        matrix[i,j] = new_value
    return matrix

def col2treat(matrix,rows,col):
    '''translate treatments from a binary vector
  
    Args:
      matrix (array): events calendary in binary information
      rows (str): events list
      col (int): column of matrix wanted to translate
    
    Output:
      treatment (str):
    '''
    treatment = '+'.join(sorted([rows[ev
                            ] for ev in range(len(rows)) 
                                 if matrix[ev,col]!=0]))
    if treatment!='':
        return treatment
    else:
        return '_'
    
def nac2comprimidos(nac_path,code_set):
    '''Creating dictionary of nac drugs container's number of tablets
  
    Args:
      nac_path (str): nac table's path
      code_set (set): nac code list
    Outputs:
        nac_dict (dic): nac codes as keys and their number of tablets as values
    '''

    nac = pd.read_csv(nac_path)
    nac['Código Nacional'] = nac['Código Nacional'].astype(str)
    nac_dict = {}

    for n in range(len(nac)):
        if nac['Código Nacional'][n] in code_set:
            text = nac['Nombre del producto farmacéutico'][n].lower()
            if re.search(r'\d+(?=\s*comprim)', text)!=None:
                nac_dict[nac['Código Nacional'][n]
                         ] = int(re.search(r'\d+(?=\s*comprim)', text).group())
            elif re.search(r'\d+(?=\s*comprim)', 
                           re.sub(r'\([^()]*\)', '', text))!=None:
                nac_dict[nac['Código Nacional'][n]
                         ] = int(re.search(r'\d+(?=\s*comprim)', 
                                re.sub(r'\([^()]*\)', '', text)).group())          
            elif re.search(r'con pelicula (\d+)', text)!=None:
                nac_dict[nac['Código Nacional'][n]
                         ] = int(re.search(r'con pelicula (\d+)',
                                    text).group().split()[-1])
    return nac_dict

Process indicators’ event log creation

With the objective of studying process indicators a function to make a process indicators’ event log is coded above. There, in addition to laboratory measures and realized exams, three artificial events have been included to aid the analysis.’INI’ event denotes each patient’s diabetes diagnosis date, ‘yFIN’ events indicate the end of each annual interval and ‘FIN’ event is the date of the end of the last completed annual interval.

Code
def evlog_process_ind(con):
    '''Generation of process indicators' eventlog

    Args:
      con : db connector variable
    '''
    con.sql("DROP VIEW IF EXISTS pre_process_ind;")
    con.sql("DROP TABLE IF EXISTS process_ind;")
    con.sql("DROP TABLE IF EXISTS process_ind1;")
    con.sql("DROP TABLE IF EXISTS process_ind2;")
    con.sql("DROP TABLE IF EXISTS process_ind3;")
    con.sql("DROP TABLE IF EXISTS process_ind4;")
    con.sql("DROP TABLE IF EXISTS process_ind5;")

    
    con.sql("CREATE VIEW pre_process_ind AS SELECT *,'start' AS cycle,\
                ROW_NUMBER() OVER (ORDER BY patient_id,date) AS actins \
            FROM (SELECT DISTINCT patient_id,'hba1c' AS event, \
            CAST(param_date AS TIMESTAMP) + INTERVAL 60 SECOND AS date \
            FROM param_incidents_view WHERE param_name = 'hba1c' \
            UNION ALL SELECT DISTINCT patient_id,'col' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 120 SECOND AS date \
            FROM param_incidents_view WHERE param_name IN  ('col','hdl','ldl') \
            UNION ALL SELECT DISTINCT patient_id,'blood_p' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 180 SECOND AS date \
            FROM param_incidents_view WHERE param_name IN  ('sbp','dbp') \
            UNION ALL SELECT DISTINCT patient_id,'indalbcr' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 240 SECOND AS date \
            FROM param_incidents_view WHERE param_name='indalbcr' \
            UNION ALL SELECT DISTINCT patient_id,'filtglom' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 300 SECOND AS date \
            FROM param_incidents_view WHERE param_name='filtglom' \
            UNION ALL SELECT DISTINCT patient_id,'bmi' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 360 SECOND AS date \
            FROM param_incidents_view WHERE param_name IN ('bmi','weight') \
            UNION ALL SELECT DISTINCT patient_id,'ocular_exam' AS event, \
                    CAST(test_date AS TIMESTAMP) + INTERVAL 420 SECOND AS date \
            FROM exams_incidents_view WHERE test_name='ocular_exam_PC' \
            UNION ALL SELECT DISTINCT patient_id,'foot_exam' AS event, \
                    CAST(test_date AS TIMESTAMP) + INTERVAL 480 SECOND AS date \
            FROM exams_incidents_view WHERE test_name='foot_exam' \
            UNION ALL SELECT patient_id,'INI' AS event, \
                    CAST(dx_date AS TIMESTAMP) AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 365 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 730 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 1095 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 1461 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 1826 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'FIN' AS event,\
                    CAST(COALESCE(deregistration_date,'2023-01-01') AS TIMESTAMP) AS date \
            FROM patient_incidents_view)") 
    
    
    
    
    con.sql("CREATE TABLE process_ind AS SELECT a.* FROM pre_process_ind a \
            JOIN (SELECT patient_id, MIN(date) as ini_date,MAX(date) as fin_date \
            FROM pre_process_ind d WHERE event IN ('INI', 'yFIN') AND \
                date<(SELECT c.date FROM pre_process_ind c \
                      WHERE c.event = 'FIN' AND c.patient_id = d.patient_id) \
            GROUP BY patient_id) b ON a.patient_id = b.patient_id \
            WHERE (a.date BETWEEN b.ini_date AND b.fin_date) OR a.event = 'FIN'")
    con.sql("CREATE TABLE process_ind1 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 1 AND (A.date <= R.date OR A.event = 'FIN'))")  
    con.sql("CREATE TABLE process_ind2 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 2 AND (A.date <= R.date OR A.event = 'FIN'))")        
    con.sql("CREATE TABLE process_ind3 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 3 AND (A.date <= R.date OR A.event = 'FIN'))")        
    con.sql("CREATE TABLE process_ind4 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 4 AND (A.date <= R.date OR A.event = 'FIN'))")        
    con.sql("CREATE TABLE process_ind5 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 5 AND (A.date <= R.date OR A.event = 'FIN'))") 
        

Distances

One of the most important aim of process mining is to show and explain the processes. However, the great variety of traces does not allow us to draw any clear conclusions and it is often necessary to simplify our data. Another option that we can do before simplifying, to avoid the excessive losing of information and give another perspective to the analysis, is to cluster the traces and to analyze them by cluster. To do that we have to measure somehow the differences between the traces; the distance between them. There are some distances that we can use to this task: edit distances, vector term similarity, LDA based distances, embedding based distances… Some of them are shown below as functions to calculate the distance matrix of traces:

Code
#######   EDIT DISTANCE   ####### 
def calculate_dm_ED(traces,measure_f):
    '''Calculate distance matrix with some edit distance.
    
    Args:
      traces (list): patients' traces
      measure_f: some edit distance function
      
    Returns:
      dm: distance matrix
    '''
    id2word = corpora.Dictionary(traces)

    traces_e = [[id2word.token2id[t[n]] for n in range(len(t))] for t in traces]
    traces_e_str = list(set([str(traces_e[i]
                                 ) for i in range(len(traces_e))]))    
    len_t_r = len(traces_e_str)
    len_t = len(traces_e)
    dm = np.zeros((len_t,len_t), dtype = np.float32)
    same = measure_f(traces_e[0],traces_e[0])
    d_dic = {str(t):dict() for t in traces_e_str}
    for i in range(len_t_r):
        d_dic[traces_e_str[i]][traces_e_str[i]] = same
        for j in range(i+1,len_t_r):
            d_ij = measure_f(eval(traces_e_str[i]),
                             eval(traces_e_str[j]))
            d_dic[traces_e_str[i]][traces_e_str[j]] = d_ij
            d_dic[traces_e_str[j]][traces_e_str[i]] = d_ij

    for i in range(len_t):
        dm[i][i] = same
        for j in range(i+1,len_t):
            t_i = str(traces_e[i])
            t_j = str(traces_e[j])            
            d_ij = d_dic[t_i][t_j]
            dm[i][j] = d_ij
            dm[j][i] = d_ij
    if same == 1:
        dm = 1 - dm  
    
    return dm

#######   TERM VECTOR SIMILARITY   #######      
def calculate_dm_TV(traces):
    '''Calculate distance matrix with term vector similarity.
    
    Args:
      traces (list): list of traces
      
    Returns:
      dm (array): distance matrix
      vectorizer: TfidfVectorizer
      X: traces vectorized with TfidVectorizer
        
    '''
    corpus = [' '.join(t) for t in traces]
    vectorizer = TfidfVectorizer(tokenizer=str.split)
    X = vectorizer.fit_transform(corpus)
    print('calculatin dm ...')
    dm = np.asarray(np.matmul(X.todense(),X.todense().T))
    dm = 1 - dm.round(decimals=4)           
    return dm, vectorizer, X

#######   LDA BASED DISTANCE   #######      
def calculate_dm_LDA(traces,T=10):
    '''Calculate distance matrix with LDA model.
    
    Args:
      traces (list): list of traces
      T (int): number of topics of LDA model
    Returns:
      dm (array): distance matrix
      lda_model: LdaModel
      id2word (dict): tokenized events as keys and events by values
        
    '''

    # Create Dictionary
    id2word = corpora.Dictionary(traces)
    
    # Term Document Frequency
    corpus = [id2word.doc2bow(text) for text in traces]
    # Make LDA model
    lda_model = LdaModel(corpus=corpus,
                         id2word=id2word,
                         num_topics=T,
                         alpha = 1,
                         eta = 'auto',
                         random_state = 123)
    get_c_topic = np.array(
        lda_model.get_document_topics(corpus,minimum_probability = -0.1))
    sigma = np.asarray([[get_c_topic[i][j][1] 
              for j in range(T)] for i in range(len(corpus))])

    sigma2 = np.asarray(np.matmul(sigma,sigma.T))
    len_t = len(traces)
    dm = np.zeros((len_t,len_t), dtype = np.float32)
    
    same = sigma2[0][0]/np.sqrt(sigma2[0][0]*sigma2[0][0])
    for i in trange(len_t):
        dm[i][i] = same
        for j in range(i+1,len_t):
            d_ij = sigma2[i][j]/np.sqrt(sigma2[i][i]*sigma2[j][j])
            dm[i][j] = d_ij
            dm[j][i] = d_ij
    
            

    dm = 1-dm
    return dm, lda_model, id2word

Clustering

Once the distance matrix is obtained, we can proceed with the clustering. Each clustering method has its characteristics and its peculiarities. For example, we have to consider we have a distance matrix and not a data frame in which we apply directly the method. A hierarchical clustering algorithm seems to be a good choice in our case because in addition to the above it allows to choose the optimal number of clusters.

The next code box contains two different functions to choose the optimal number of clusters:

Code
def dendogram(dm,output_png='../../outputs/dendogram.png'):
  '''Plot and save dendogram.
    
    Args:
      dm (array): distance matrix
      output_png (str): saved dendogram's path

  '''

  dm_condensed = squareform(dm)
  
  matrix = linkage(
      dm_condensed,
      method='complete'
      )
  sys.setrecursionlimit(10000000)
  dn = dendrogram(matrix,truncate_mode='lastp',p=80)
  sys.setrecursionlimit(1000)
  plt.title('Dendrogram')
  plt.ylabel('Distance')
  plt.xlabel('Patients traces')
  plt.savefig(output_png)
  plt.clf()

def kelbow(dm,
           elbow_metric='distortion',
           locate_elbow=False,
           output_path='../../outputs/'):
  
  '''Plots to choose optimal clusters.
    
    Args:
      dm (array): distance matrix
      elbow_metric (str): name of the method
      locate_elbow (boolean): True if want to return optimal number of clusters
    Returns:
      k_opt (int)(optional): optimal number of clusters according to method
  '''

  model = AgglomerativeClustering(metric = "precomputed",
                                  linkage = 'complete')
  # k is range of number of clusters.
  visualizer_ = KElbowVisualizer(model,
                                k=(2,25),
                                timings=False,
                                xlabel = 'cluster numbers',
                                metric=elbow_metric,
                                locate_elbow=locate_elbow)
  # Fit data to visualizer
  output_file = output_path+elbow_metric+'.png'
  visualizer_.fit(dm)
  # Finalize and render figure
  #visualizer_.show(output_path=output_file,clear_figure=False)
  visualizer_.show(output_path=output_file)
  k_opt=None
  if locate_elbow:
    k_opt = visualizer_.elbow_value_ 
    return k_opt

The function ‘clust’ clusterizes traces in prefixed number of clusters:

Code
def clust(clust_n,dist_matrix,df_,id2trace,patients):
  '''clusterize distance matrix.
    
    Args:
      clust_n (int): number of clusters obtained
      dist_matrix (array): distance matrix
      df_ (dataframe): event log
      id2trace (dict): patient ids as keys and their traces as values
      patients (list): patients' ids in same order as in dm
    Returns:
      df_ (dataframe): dataframe with patients and their clusters
  '''
  traces = list(id2trace[id] for id in sorted(id2trace.keys()))     


  model = AgglomerativeClustering(n_clusters=clust_n,
                                  metric = "precomputed",
                                  linkage = 'complete')
  model.fit(dist_matrix)
  labels = model.labels_

  cluster_list ={id: labels[traces.index(id2trace[id])
                            ] for id in patients}

  df_['cluster'] = [cluster_list[df_['ID'][i]] for i in range(len(df_))]
  return df_

Descriptive analysis of treatments’ eventlog

The implementation below is made to show the most frequent traces in each cluster:

Code
def make_data_dict(log,top_k,col_id):
  '''Obtain most frequent traces and their statistics
    
    Args:
      log (dataframe): event log 
      top_k (int): number of traces want to show
      col_id (str): patients id column's name in df_
    Returns:
      data_dict (dict): traces as keys and ther statistics as values   
  '''

  len_id = len(set(log[col_id]))
  log_freq = pm4py.stats.get_variants(log)
  freq_list = [(t,log_freq[t],len(t)) for t in set(log_freq.keys())]
  trace = [list(t[0]) for t in sorted(freq_list, key=lambda x: 
                                              (len_id-x[1],x[2]))[:top_k]]
  cases = [t[1] for t in sorted(freq_list, key=lambda x: 
                                              (len_id-x[1],x[2]))[:top_k]]
  top_k = min(top_k,len(cases))
  percentage = [100*cases[c]/len_id for c in range(top_k)]
  cum_percentage = [sum(percentage[:p+1]) for p in range(top_k)]
  data_dict = {"Trace": trace,
               "Percentage": percentage,
               "Cases": cases,
               "Cumulative Percentage": cum_percentage}
  return data_dict
  
  
def update_color_dict(color_dict,data_dict):
  '''update of the color dict to include new events
    
    Args:
      color_dict (dict): events as keys and colors as values 
      data_dict (dict):  traces as keys and ther statistics as values
    Returns:
      color_dict (dict): events as keys and colors as values    
  '''
  cmap = plt.cm.get_cmap('tab20')
  for event in set(itertools.chain.from_iterable(data_dict['Trace'])):
      if event not in color_dict and len(color_dict)==20:
         cmap = plt.cm.get_cmap('tab20b')
      if event not in color_dict:    
        try:
          color_dict.update({event:cmap(len(color_dict))})
        except:
          color_dict.update({event:cmap(2*(len(color_dict)-20))})
  return color_dict 


def trace_plotter(data_dict,
                  color_dict,
                  acronym,
                  output_file,
                  font_size=10,
                  percentage_box_width=0.8,
                  size=(15,9)):
  '''configuration of the trace_explorer plot
    
    Args:
      color_dict (dict): events as keys and colors as values 
      data_dict (dict):  traces as keys and their statistics as values
      acronym (dict): events as keys and their acronyms as values
      output_file (str): figure's path
      font_size (int): font size
      percentage_box_width (float): event boxes' width
      size (tuple): figure's size
  '''

  fig, ax = plt.subplots(figsize=size)
  percentage_position = max(len(t) for t in data_dict["Trace"]
                            ) + percentage_box_width*3 +0.5
  for row, (trace, percentage,cases,cum_percentage
            ) in enumerate(zip(data_dict["Trace"],
                               data_dict["Percentage"],
                               data_dict["Cases"],
                               data_dict["Cumulative Percentage"]),
                               start=1):
    for col, acr in enumerate(trace, start=1):
        ax.add_patch(plt.Rectangle((col - 0.5, row - 0.45), 1, 0.9,
                                    facecolor=color_dict[acr],
                                    edgecolor='white'))
        ax.text(col, 
                row, 
                acr, 
                ha='center', 
                va='center', 
                color='white',
                fontsize = font_size, 
                fontweight='bold')
        ax.add_patch(plt.Rectangle((
          percentage_position -percentage_box_width*2.5,
          row - 0.45), percentage_box_width, 0.9,
          facecolor='grey', edgecolor='white'))
        ax.text(percentage_position-percentage_box_width*2,
                row,
                f'{percentage:.2f}%',
                ha='center',
                va='center',
                color='white',
                fontsize = font_size+2)
        ax.add_patch(plt.Rectangle((
          percentage_position - percentage_box_width*1.5,
          row - 0.45), percentage_box_width, 0.9,
          facecolor='grey', edgecolor='white'))
        ax.text(percentage_position-percentage_box_width,
                row,
                f'{cases}',
                ha='center',
                va='center',
                color='white',
                fontsize = font_size+4)
        ax.add_patch(plt.Rectangle((percentage_position-percentage_box_width*0.5, 
                                    row - 0.45), percentage_box_width, 0.9,
                                    facecolor='grey', edgecolor='white'))
        ax.text(percentage_position,
                row,
                f'{cum_percentage:.2f}%', 
                ha='center',
                va='center',
                color='white',
                fontsize = font_size+2)
      
  ax.set_xlim(0.5, percentage_position+0.5)
  ax.set_xticks(range(1, int(percentage_position-1)))
  ax.set_ylabel('Traces',fontsize = font_size+3)
  ax.set_ylim(len(data_dict["Trace"]) + 0.45, 0.55) # y-axis is reversed
  ax.set_yticks([])
  ax.set_xlabel('Activities',fontsize = font_size+3)
  
  handles = [plt.Rectangle((0, 0), 0, 0, facecolor=color_dict[acr],
                            edgecolor='black', label=acronym[acr])
              for acr in acronym if acr in set(
                itertools.chain.from_iterable(data_dict['Trace']))]
  ax.legend(handles=handles, 
            bbox_to_anchor=[1.02, 1.02],
            loc='upper left',
            fontsize = font_size+6)
  for dir in ['left', 'right', 'top']:
      ax.spines[dir].set_visible(False)
  plt.tight_layout()
  plt.savefig(output_file)
  plt.close()


def trace_explorer(con,
                   cond,
                   top_k=5,
                   id_col='ID',
                   ev_col='Event',
                   date_col='date',
                   clust_col='cluster',
                   color_dict={}):
  '''Plot each clusters most frequent traces
    
    Args:
      con : db connector variable
      cond (str): predominal clinical condition's code
      top_k (int):  traces as keys and their statistics as values
      id_col (str): patients id column's name in evlog_file
      ev_col (str): events column's name in evlog_file
      date_col (str): events dates column's name in evlog_file
      clust_col (str): cluster column's name in evlog_file
      color_dict (dict): events as keys and colors as values  
  '''
  log_ = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered").df(),
                                case_id=id_col,
                                activity_key=ev_col,
                                timestamp_key=date_col)                  
  log_ = log_.sort_values([id_col,date_col])
  log_ = log_[log_['cycle']=='start'] 
  for clust in set(log_[clust_col]):
      log = log_[log_[clust_col]==clust]
      len_id = len(set(log[id_col]))
      acronym = {t:t for t in sorted(set(log[ev_col]))}
      data_dict = make_data_dict(log,top_k,id_col)
      color_dict = update_color_dict(color_dict, data_dict)
      trace_plotter(data_dict,color_dict,acronym,
                    '../../outputs/t_cluster_%s_%i.png' % (cond,clust))
  return color_dict

To get the process maps of each cluster next R functions can be used:

Code
load_log <- function(con,
                     query,case_id="ID",
                     activity_id="Event", 
                     lifecycle_id="cycle",
                     activity_instance_id="actins",
                     timestamp="date"){
  
  eventlog <- dbGetQuery(con,query)  
  eventlog = eventlog[order(eventlog$ID),]
  #To transform date to a format we can work with
  eventlog$date = as.POSIXct(eventlog$date, tz = "", format="%Y-%m-%d" ,
                                   tryFormats = c("%Y/%m/%d",
                                                  "%Y/%m/%d"),
                                   optional = FALSE) 
  
  
  
  evLog = eventlog %>%
    mutate(resource = NA) %>%
    mutate(cycle = fct_recode(cycle,
                              "start" = "start",
                              "complete" = "end")) %>%
    eventlog(
      case_id = case_id,
      activity_id = activity_id, 
      lifecycle_id = lifecycle_id, 
      activity_instance_id = activity_instance_id, 
      timestamp = timestamp, 
      resource_id = 'resource'
    )
  return(evLog)
}
make_process_map <- function(log,t_freq,output_file){
  log %>%
    filter_activity_frequency(percentage = 1) %>% # show only most frequent
    filter_trace_frequency(percentage = t_freq) %>%
    process_map(type_nodes = performance(mean,units='days'),
                type_edges = frequency('relative_case'),
                sec_edges = performance(mean,units='days'),
                render = T) %>% 
    export_svg %>% 
    charToRaw %>%
    rsvg_png (output_file,width=2000)
}

process_map_by_cluster <- function(evLog,t_freq,cond_code){
  for (clust in unique(evLog$cluster)) {
    log <- evLog %>% 
      filter(cluster == clust)
    make_process_map(log,t_freq,gsub("src/analysis-scripts/",
                                     "",here("outputs",sprintf(
                                       "pm_cluster_%s_%d.png", 
                                       cond_code, clust) )))

  }
}

Conformance checking

Conformance checking is a technique used to check process compliance by comparing event logs for a discovered process with the existing reference model (target model) of the same process. Basing on the DM2 treatment algorithm previous shown, with a software called Carassius , we created the next Petri Nets that are going to be useful as treatment guidelines in reference to glycated hemoglobin measures.

Figure 1: DM2 Treatment Algorithm’s interpretation to patients with some CV disease in Petri Net format

Figure 2: DM2 Treatment Algorithm’s interpretation to patients with heart failure in Petri Net format

Figure 3: DM2 Treatment Algorithm’s interpretation to patients with some chronic kidney disease in Petri Net format

Figure 4: DM2 Treatment Algorithm’s interpretation to patients with frailty in Petri Net format

Figure 5: DM2 Treatment Algorithm’s interpretation to patients with obesity in Petri Net format

Figure 6: DM2 Treatment Algorithm’s interpretation to patients without predominant clinical condition in Petri Net format

Fitness is the metric that measures how much a trace is distanced from a given process model, or from the guidelines in this case. There are different methods to calculate this metric but in the code below is used the aligned fitness. Since in this metric the initial marking and the final marking have to be fixed we included the events ‘INI’ and ‘FIN’ in the Petri Net and in each trace. Adding this artificial events allows us to compare all traces fitness in the same conditions.

Code
def id2treat_fitness(log ,
               net, 
               initial_marking, 
               final_marking, 
               clust_col='cluster',
               date_col='date',
               ev_col='Event'):
  '''Obtain traces fitness
    
    Args:
      log (dataframe): event log 
      net: petri net
      initial_marking: initial place in the petri net
      final_marking: final place in the petri net
      clust_col (str): cluster column's name in log
      date_col (str): events dates column's name in log
      ev_col (str): events column's name in log
    Returns:
      df (dataframe): traces, their clusters and their fitnesses  
  '''
  id2trace = {id:list(log['Event'][log['case:concept:name']==id]
                      ) for id in set(log['case:concept:name'])}
  id2ids = {id:[id2 for id2 in set(id2trace.keys()) if id2trace[id]==id2trace[id2]
                ] for id in set(log['case:concept:name'])}
  for id in set(id2ids.keys()):
      try:
          for id2 in id2ids[id]:
              if id2!=id:
                  del id2ids[id2]
      except:
          continue
  
  id2fitness = dict()
  for name in set(id2ids.keys()):
      log2 = log.drop(log.index[log['case:concept:name'] !=name])
      new = log2.copy().iloc[[0, -1]]
      date_list = list(new[date_col])
      index = list(new['@@index'])
      actins = list(new['actins'])
      clust = new[clust_col].values[0]
      new[date_col] =  new['time:timestamp'] = [date_list[0]-timedelta(days=1),
                                                date_list[1]+timedelta(days=1)]
      new[ev_col] = new['concept:name'] = ['INI','FIN']
      new['@@index'] = [index[0]-1,index[1]+1]
      new['actins'] = [actins[0]-1,actins[1]+1]
    
      log2 = pd.concat([log2,new]).sort_values('time:timestamp')
      aligned_fitness = pm4py.conformance_diagnostics_alignments(
                        log2, net, initial_marking, final_marking)[0]['fitness']
      for id in id2ids[name]:
          id2fitness[id] = {'ID':id,
                            'aligned_fitness':aligned_fitness,
                            clust_col: clust}
  
  df = pd.DataFrame(id2fitness).T
  df.index = range(len(df))
  return df

The function below takes a treatments’ event log and returns each patient’s period’s fitness.

Code
def id2treat_fitness_by_interval(con,
                           cond,
                           pn_file,
                           ini_place='place100',
                           fin_place='place111',
                           date_n_col='date',
                           fixed_period_time=90):
    '''Obtain traces fitness by intervals of fixed_period_time days
    
    Args:
      con: db connector variable
      cond (str): predominal clinical condition's code
      pn_file (str): petri net's path
      ini_place (str): initial place in the petri net
      fin_place (str): final place in the petri net
      date_n_col (str): events dates column's name in log
      fixed_period_time (int): number of days in each interval
  '''
    net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
    initial_marking = Marking()
    initial_marking[list(net.places)[[str(p) for p in net.places].index(
        ini_place)]] = 1
    final_marking = Marking()
    final_marking[list(net.places)[[str(p) for p in net.places].index(
        fin_place)]] = 1
    event_log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust \
    WHERE cycle = 'start'").df(),
                                case_id='ID',
                                activity_key='Event',
                                timestamp_key=date_n_col) 
    event_log = event_log[event_log['cycle']=='start'] 
    baseline = datetime.strptime('2017-01-01', "%Y-%m-%d")
    endline = datetime.strptime('2023-01-01', "%Y-%m-%d")
    dd = [baseline + timedelta(days=x) for x in range((
              endline-baseline).days + 1)]
    dd_INI = dd[0]-timedelta(days=1)
    dd_FIN = dd[-1]+timedelta(days=1)
    event_log['time:timestamp'] = event_log['time:timestamp'].dt.tz_localize(None)
    event_log[date_n_col] = event_log[date_n_col].dt.tz_localize(None)
    event_log['date_'] = event_log[date_n_col].apply(
        lambda x: dd.index(x))
    event_log[date_n_col] = event_log[date_n_col].apply(
        lambda x: dd.index(x))
    event_log[date_n_col] = event_log.groupby('ID')[date_n_col].apply(
        lambda x: x-x.min())
    
    hosp = con.sql("SELECT * FROM cmbd_incidents_postdx_first_view").df()
    hosp['admission_date'] = hosp['admission_date'].dt.tz_localize(None).apply(
        lambda x: dd.index(x))
    hosp = dict(zip(hosp.patient_id,hosp.admission_date))
    
    period2fitness = pd.DataFrame()
    id_list_df = []
    p_start = []
    p_end = []
    fitness = []
    date_0 = []
    status = []
    id_list = list(set(event_log['ID']))
    for id in tqdm(id_list):
        event_log_id = event_log[event_log['ID']==id]
        hosp_d = hosp.get(id,100000)-min(event_log_id['date_'])
        date_max = min(max(event_log_id[date_n_col]),hosp_d)
        date_min = min(event_log_id['time:timestamp'])
        n_periods = date_max//fixed_period_time
        n_periods_r = date_max/fixed_period_time
        if date_max<=fixed_period_time:
            continue
        for n in range(1,n_periods+1):
            event_log_id_n = event_log_id[
                event_log_id[date_n_col]<n*fixed_period_time]
            
            new = event_log_id.copy().iloc[[0, -1]]
            actins = list(new['actins'])
            index = list(new['@@index'])
            new['time:timestamp'] = [dd_INI,dd_FIN]
            new['concept:name'] = ['INI','FIN']
            new['@@index'] = [index[0]-1,index[1]+1]
            new['actins'] = [actins[0]-1,actins[1]+1]
            event_log_id_n = pd.concat([event_log_id_n,new]
                                       ).sort_values('time:timestamp')
            aligned_fitness = pm4py.conformance_diagnostics_alignments(
                             event_log_id_n, net, 
                             initial_marking, final_marking)[0]['fitness']
            id_list_df.append(id)
            p_start.append((n-1)*fixed_period_time)
            p_end_value = n*fixed_period_time
            if n==n_periods:
                p_end_value = date_max-fixed_period_time
            p_end.append(p_end_value)
            fitness.append(aligned_fitness)
            date_0.append(date_min)
            if (n==n_periods or n==n_periods_r-1) and hosp_d==date_max :
                status.append(1)
                break
            else:
                status.append(0)
    
    period2fitness['ID'] = id_list_df
    period2fitness['t_0'] = p_start
    period2fitness['t_1'] = p_end
    period2fitness['fitness'] = fitness
    period2fitness['ini_date'] = date_0
    period2fitness['status'] = status
    con.sql(f"DROP TABLE IF EXISTS period2fitness_{cond}")
    con.sql(f"CREATE TABLE period2fitness_{cond} AS SELECT *,\
            MAX(status) OVER (PARTITION BY ID) AS status2  FROM period2fitness \
            WHERE t_0!=t_1") 

In the next function is shown a boxplot function to show clusters’ fitness distribution.

Code
def treatments_clusters_boxplot(con,
                cond,
                pn_file,
                pn_png_file,
                ini_place='place100',
                fin_place='place111',
                date_col='date',
                ev_col='Event',
                clust_col='cluster',
                output_png='../../outputs/fitness_by_cluster.png'):
  '''Barplot of the fitness of each cluster
    
    Args:
      con: db connector variable
      cond (str): predominant clinical condition
      pn_file: petri net's path
      ini_marking: initial place in the petri net
      fin_marking: final place in the petri net
      date_col (str): events dates column's name in log
      clust_col (str): cluster column's name in log
      ev_col (str): events column's name in log
      output_png (str): created figure's path
  '''
  log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered \
  WHERE cycle = 'start'").df(),
                                case_id='ID',
                                activity_key=ev_col,
                                timestamp_key=date_col)                  

  log = log.sort_values(date_col)
  log.index = range(len(log))
  net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
  initial_marking = Marking()
  initial_marking[list(net.places)[[str(p) for p in net.places].index(
    ini_place)]] = 1
  final_marking = Marking()
  final_marking[list(net.places)[[str(p) for p in net.places].index(
    fin_place)]] = 1
  pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
  df = id2treat_fitness(log,net,initial_marking,final_marking,
                  clust_col,date_col,ev_col)
  df['aligned_fitness'] = pd.to_numeric(df['aligned_fitness'])
  df['cluster'] = pd.to_numeric(df['cluster'])
  df['ID'] = df['ID'].astype("string")
  con.sql(f"DROP TABLE IF EXISTS eventlog_{cond}_byclust")                
  con.sql(f"CREATE TABLE eventlog_{cond}_byclust AS \
  SELECT * FROM df")
  data = [list(df['aligned_fitness'][df[clust_col]==i])
          for i in sorted(set(df[clust_col]))]
   
  fig = plt.figure(figsize =(10, 7))
  # Creating axes instance
  ax = fig.add_axes([0, 0, 1, 1])
  # Creating plot
  bp = ax.boxplot(data,labels=[i for i in sorted(set(df[clust_col]))])
  plt.xlabel("Clusters")
  plt.ylabel("Aligned Fitness")
  # show plot
  plt.savefig(output_png,bbox_inches='tight')
  plt.close(fig)
  

Process indicators can also be represented in a Petri Net. Therefore we created the next Petri Nets to analyze whether the traces obey the guidelines the first five years:

Figure 7: Petri Net of process indicators of the first year

Figure 8: Petri Net of process indicators of the first two years

Figure 9: Petri Net of process indicators of the first three years

Figure 10: Petri Net of process indicators of the first four years

Figure 11: Petri Net of process indicators of the first five years

The function below takes a process indicators’ event log and returns each patient’s fitness by year. The method to calculate the fitness in this case is the token base replay method.

Code
def id2process_fitness(con,
                       query,
                       pn_file,
                       pn_png_file,
                       ini_place='place37',
                       fin_place='place148',
                       id_col = 'patient_id',
                       event_col = 'event',
                       date_col='date',):
    '''Calculate the fitness of a given process indicators' event log
    
    Args:
      con: db connector variable
      cond (str): predominal clinical condition's code
      query (str): query to select event log
      pn_file (str): petri net's path
      ini_place (str): initial place in the petri net
      fin_place (str): final place in the petri net
      id_col (str): ids' column's name in event log
      event_col (str): events' column's name in event log
      date_col (str): events' dates' column's name in event log
    Returns:
      df (dataframe): dataframe of ids' fitness
  '''
    net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
    initial_marking = Marking()
    initial_marking[list(net.places)[[str(p) for p in net.places].index(
        ini_place)]] = 1
    final_marking = Marking()
    final_marking[list(net.places)[[str(p) for p in net.places].index(
        fin_place)]] = 1
    pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
    event_log = pm4py.format_dataframe(con.sql(query).df(),
                                case_id=id_col,
                                activity_key=event_col,
                                timestamp_key=date_col) 
    date_list = []
    fit_list = []
    id_list = list(set(event_log[id_col]))
    for id in tqdm(id_list):
        log = event_log.drop(event_log.index[event_log['case:concept:name'] !=id]
                              ).sort_values('time:timestamp')
        fit_obj = pm4py.conformance.conformance_diagnostics_token_based_replay(
            log, net, initial_marking, final_marking)
        date_list.append(list(log['time:timestamp'])[-2])
        fit_list.append(fit_obj[0]['trace_fitness'])
        
    log_0 = log[log['concept:name'].isin(['INI','FIN','yFIN'])]
    alfa = pm4py.conformance.conformance_diagnostics_token_based_replay(
        log_0, net, initial_marking, final_marking)[0]['trace_fitness']
    
    df =  pd.DataFrame()
    df['ID'] = id_list
    df['fitness'] = [(x-alfa)/(1-alfa) for x in fit_list]
    df['date'] = date_list
    return df

Decision mining

Decision mining allows to analyze the event transitions in different part of processes. With another words, we can measure patients’ characteristics or their relevance in a specific place of the passed petri net. The next function makes a decision tree explaining how patients characteristics are taking into account. Moreover it shows a boxplot of the relevance of each feature in the decision tree obtained with mean decrease impurity which calculates each feature importance as the sum over the number of splits (across all tress) that include the feature, proportionally to the number of samples it splits.

Code
def decision_tree_pn_place(con,
                           cond,
                           features_list = ['age','sex','copayment'],
                           pn_file="./PN.pnml",
                           place2analyze='place9',
                           ini_place='place100',
                           fin_place='place111'):
  '''Decision tree and features importances in selected place of PN
    
    Args:
      con : db connector variable
      cond (str): predominant clinical condition's 
      features_list (list): features wanted to analyze
      event_log_file (str): event log's path
      pn_file (str): petri net's path
      place2analyze (str): place wanted to analyze in petri net
      ini_place (str): initial place in the petri net
      fin_place (str): final place in the petri net
      
  '''                           
  log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond} WHERE \
  ID in (SELECT ID FROM eventlog_{cond}_clust)").df(), case_id='ID',
                               activity_key='Event',timestamp_key='date')  
  log = log[log['cycle']=='start']    
  log = log[['concept:name','time:timestamp','case:concept:name']]
  ini = min(list(log['time:timestamp']))-timedelta(days=1)
  fin = max(list(log['time:timestamp']))+timedelta(days=1)
  patients = list(set(log['case:concept:name']))
  add_ini_fin = {'case:concept:name' : patients*2,
                   'time:timestamp':[ini]*len(patients)+[fin]*len(patients),
                   'concept:name':['INI']*len(patients)+['FIN']*len(patients)}
  log = pd.concat([log,pd.DataFrame(add_ini_fin)])
        
  log = log.sort_values(['case:concept:name','time:timestamp'])
  log.index = range(len(log))
  patients_df = con.sql("SELECT patient_id,age,sex,copayment FROM patient_incidents_view").df()
  patients_df = patients_df.fillna(2)
  patients_df['copayment'] = patients_df['copayment'].values.astype(int).astype(str)
  log_patients = log.merge(patients_df,left_on='case:concept:name',
                             right_on='patient_id',how='left')
  log_patients = log_patients[['concept:name', 'time:timestamp',
                                 'case:concept:name', 'age',
                                 'sex','copayment']]
  net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
    
  initial_marking = Marking()
  initial_marking[list(net.places)[[str(p) 
                    for p in net.places].index(ini_place)]] = 1
  final_marking = Marking()
  final_marking[list(net.places)[[str(p) 
                  for p in net.places].index(fin_place)]] = 1
  try:
    X, y, class_names = decision_mining.apply(log_patients,
                                              net,
                                              initial_marking,
                                              final_marking,
                                              decision_point=place2analyze)
    clf, feature_names, classes = decision_mining.get_decision_tree(log_patients,
                                                    net,
                                                    initial_marking,
                                                    final_marking, 
                                                    decision_point=place2analyze)
    gviz = tree_visualizer.apply(clf, feature_names, classes)
    gviz.save(filename='decision_tree_%s' % cond,
                directory='../../outputs')              
    visualizer.view(gviz)
      
    importances = clf.feature_importances_
    tree_importances = pd.Series(importances, index=feature_names)
    
    fig, ax = plt.subplots()
    tree_importances.plot.bar(ax=ax)
    ax.set_title("Feature importances using MDI")
    ax.set_ylabel("Mean decrease in impurity")
    fig.tight_layout()
    fig.savefig('../../outputs/barplot_features_importance_%s.png' % cond)
    plt.close(fig)
  except ValueError as err:
    logging.basicConfig(level=logging.INFO)
    logging.info('Error in decidsion making: %s', str(err))
    if str(err)!='No objects to concatenate':
      UNKNOWN_ERROR

Time dependent Cox model

The link between guideline adherence, in terms of performed process measures, and clinical outcomes is a highly demanded issue in diabetes care. A Cox model is a statistical technique for exploring the relationship between the survival of a patient and several explanatory variables. One of the strengths of the Cox model is its ability to encompass covariates that change over time, such as treatment adherence. A time dependent Cox model can be made using each patient trace’s fitness at different time interval.

Joint Latent Class Model

Joint models are used to analyse simultaneously two related phenomena, the evolution of a variable and the occurence of an event. Joint latent class models (JLCM) consist of a linear mixed model and a proportional hazard model linked by the latent classes. The population is split in several groups, the latent classes, and each class is caracterized by a specific evolution of the dependent variable and an associated risk of event. Using fitness as time dependent treatment adherence measure we can made a joint latent class model.

Results

Cohort description

Code
###INCIDENTS
patient_inc <- dbGetQuery(con,
                       "SELECT *,
                       CASE WHEN sex = 0 THEN 'Male' 
                       ELSE 'Female' END AS sexo FROM patient_incidents_view") 
patient_inc <- patient_inc %>%
  left_join( country2cont %>% select(country, CC), by = "country") %>%
  mutate(education=factor(education, levels=c(0,1,2,3),
      labels=c("Without studies","Primary school",
               "High school","University")),
      copayment=factor(copayment,levels=c(0,1),
                       labels=c("less than 18000","more or equal than 18000")))
patient_inc$CC[patient_inc$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")  

patient_inc%>%summary_factorlist(dependent,
                              explanatory,
                              total_col = TRUE,
                              cont="mean",
                              cont_cut=7,
                              na_include = TRUE,
                              na_to_prop =  FALSE, 
                              include_col_totals_percent    =FALSE,
                              add_col_totals = TRUE)-> patient_inc_tab

kable(patient_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 1: Demographic and socioeconomic characteristics of incidents patients
label levels Female Male Total
Total N 4778 6418 11196
age Mean (SD) 61.5 (15.4) 57.9 (13.1) 59.4 (14.2)
age_dx Mean (SD) 64.5 (15.2) 61.0 (13.0) 62.5 (14.1)
CC AF 294 (6.4) 401 (6.3) 695 (6.4)
AS 35 (0.8) 63 (1.0) 98 (0.9)
EU 164 (3.6) 226 (3.6) 390 (3.6)
OC 1 (0.0) 1 (0.0) 2 (0.0)
SA 433 (9.4) 318 (5.0) 751 (6.9)
Spain 3665 (79.8) 5332 (84.1) 8997 (82.3)
copayment less than 18000 3861 (81.5) 4558 (71.3) 8419 (75.7)
more or equal than 18000 875 (18.5) 1832 (28.7) 2707 (24.3)
(Missing) 42 28 70
education Without studies 875 (22.6) 886 (16.6) 1761 (19.1)
Primary school 990 (25.6) 1332 (25.0) 2322 (25.2)
High school 1664 (43.0) 2601 (48.8) 4265 (46.4)
University 342 (8.8) 510 (9.6) 852 (9.3)
(Missing) 907 1089 1996
avg_income Mean (SD) 14377.1 (2579.1) 14359.5 (2490.1) 14366.8 (2527.2)
Code
param_inc <- dbGetQuery(con,"SELECT * FROM param_incidents_predx_last_view")
param_cat_inc <- dbGetQuery(con,"SELECT * FROM param_cat_incidents_predx_last_view")
dependent="sexo"


param_inc2unnest <- param_inc %>% 
  pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
  left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id")

param_cat_inc$param_cat_value = as.numeric(as.character(param_cat_inc$param_cat_value))
param_cat_inc2unnest <- param_cat_inc %>% 
  pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value)%>%
  left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id") %>%
  mutate(physical_activity=factor(physical_activity, levels=c(0,1,2,3),
                                  labels=c("Incapacity", "Inactive", "Partially Active","Active")),
         smoking_status=factor(smoking_status, levels=c(0,1,2,3),
                               labels=c("Non Smoker", "Ex-smoker < 1 year", "Ex-smoker >= 1 year","Smoker")),
         alcohol=factor(alcohol, levels=c(0,1,2),
                        labels=c("Abstinent", "Moderate Drinker","Heavy Drinker")),
         vaccination_covid=factor(vaccination_covid, levels=c(0,1,2,3,4,5,6,7),
                                  labels=c("No", "Yes, BioNTech-Pfizer","Yes, Moderna","Yes, Astrazeneca","Yes, Johnson-Johnson","Yes, Novavax","Yes, Other","Yes, Unknown type")),
         vaccination_flu=factor(vaccination_flu, levels=c(0,1), labels=c("No", "Yes")),
         working_status=factor(working_status,levels=c(0,1,2),labels=c('Active','Unemployed','Pensionist')))

param_inc_tab <- rbind(
  param_inc2unnest %>%
    summary_factorlist(dependent ,
                       c("bmi","height","weight","sbp","dbp","hba1c","col",
                         "hdl","ldl","trgl","creat","indalbcr","alb"),
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,add_row_totals = TRUE,
                       include_row_missing_col = TRUE,
                       include_col_totals_percent   =FALSE,add_col_totals = TRUE),
  param_cat_inc2unnest %>%
    summary_factorlist(dependent ,
                       c("physical_activity","smoking_status","alcohol","vaccination_flu",'working_status'),
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,cont_cut=8,add_row_totals = TRUE,
                       include_row_missing_col = TRUE))
kable(param_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 2: Demographic and socioeconomic characteristics of incidents patients
label Total N Missing N levels Female Male Total
Total N 4320 5795 10115
bmi 6207 (61.4) 3908 Mean (SD) 32.8 (12.6) 32.1 (9.8) 32.4 (11.1)
height 6306 (62.3) 3809 Mean (SD) 156.0 (12.7) 169.4 (15.2) 163.6 (15.7)
weight 6726 (66.5) 3389 Mean (SD) 79.5 (18.1) 92.2 (18.6) 86.8 (19.4)
sbp 8283 (81.9) 1832 Mean (SD) 134.7 (19.9) 136.7 (19.6) 135.8 (19.7)
dbp 8275 (81.8) 1840 Mean (SD) 79.1 (11.4) 81.6 (12.1) 80.5 (11.9)
hba1c 7573 (74.9) 2542 Mean (SD) 6.9 (1.4) 7.2 (1.7) 7.1 (1.6)
col 9414 (93.1) 701 Mean (SD) 204.9 (43.3) 196.5 (47.9) 200.1 (46.2)
hdl 9338 (92.3) 777 Mean (SD) 48.1 (12.2) 41.7 (10.9) 44.5 (11.9)
ldl 8629 (85.3) 1486 Mean (SD) 124.8 (35.2) 118.4 (38.5) 121.3 (37.2)
trgl 9375 (92.7) 740 Mean (SD) 163.6 (114.6) 200.8 (221.1) 184.7 (183.7)
creat 9559 (94.5) 556 Mean (SD) 0.8 (0.3) 1.0 (0.4) 0.9 (0.3)
indalbcr 7266 (71.8) 2849 Mean (SD) 29.0 (117.7) 52.1 (286.6) 42.0 (229.2)
alb 4332 (42.8) 5783 Mean (SD) 26.3 (18.5) 26.4 (18.6) 26.3 (18.6)
physical_activity 3659 (36.5) 6363 Incapacity 31 (1.9) 24 (1.2) 55 (1.5)
Inactive 265 (16.1) 228 (11.3) 493 (13.5)
Partially Active 678 (41.3) 784 (38.9) 1462 (40.0)
Active 669 (40.7) 980 (48.6) 1649 (45.1)
(Missing) 2456 3907 6363
smoking_status 4604 (45.9) 5418 Non Smoker 1368 (69.5) 973 (36.9) 2341 (50.8)
Ex-smoker < 1 year 46 (2.3) 136 (5.2) 182 (4.0)
Ex-smoker >= 1 year 240 (12.2) 807 (30.6) 1047 (22.7)
Smoker 315 (16.0) 719 (27.3) 1034 (22.5)
(Missing) 2130 3288 5418
alcohol 4003 (39.9) 6019 Abstinent 1531 (85.9) 1063 (47.9) 2594 (64.8)
Moderate Drinker 239 (13.4) 1043 (47.0) 1282 (32.0)
Heavy Drinker 12 (0.7) 115 (5.2) 127 (3.2)
(Missing) 2317 3702 6019
vaccination_flu 3691 (36.8) 6331 No 0 (0.0) 0 (0.0) 0 (0.0)
Yes 1799 (100.0) 1892 (100.0) 3691 (100.0)
(Missing) 2300 4031 6331
working_status 8716 (87.0) 1306 Active 1238 (37.1) 2222 (41.3) 3460 (39.7)
Unemployed 0 (0.0) 0 (0.0) 0 (0.0)
Pensionist 2101 (62.9) 3155 (58.7) 5256 (60.3)
(Missing) 760 546 1306
Code
patient_inc$deregistration_date <- patient_inc$deregistration_date %>%
  replace_na(as.Date('2023-01-01'))
patient_inc <- patient_inc %>% 
  mutate(follow_up_time=as.numeric(difftime(deregistration_date,
                                            dx_date,units="days")/365.25)) %>%
  filter(follow_up_time!=0)

ss_use_inc <- dbGetQuery(con," SELECT * FROM main.ss_use \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01') AND \
                  visit_date>= (SELECT dx_date FROM main.patient WHERE \
                  main.patient.patient_id = main.ss_use.patient_id)") %>% mutate(
                        visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
                        labels=c("PC_physician", "PC_nurse", "PC_social_worker",
                                 "PC_emergency_service","PC_others",
                                 "Specialized_visit_physician",
                                 "Specialized_visit_nurse",
                                 "Specialized_visit_unknown_professional")))

ss_use_inc1 <- ss_use_inc %>% 
  filter(visit_loc == 1)
use_inc_freq1 <- as.data.frame.matrix(table(ss_use_inc1$patient_id,ss_use_inc1$visit_type))
use_inc_freq1 <- cbind(patient_id = rownames(use_inc_freq1), use_inc_freq1)
use_inc_freq1_time <- left_join(use_inc_freq1, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_inc2 <- ss_use_inc %>% 
  filter(visit_loc == 2)
use_inc_freq2 <- as.data.frame.matrix(table(ss_use_inc2$patient_id,ss_use_inc2$visit_type))
use_inc_freq2 <- cbind(patient_id = rownames(use_inc_freq2), use_inc_freq2)
use_inc_freq2_time <- left_join(use_inc_freq2, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_inc3 <- ss_use_inc %>% 
  filter(visit_loc == 3)
use_inc_freq3 <- as.data.frame.matrix(table(ss_use_inc3$patient_id,ss_use_inc3$visit_type))
use_inc_freq3 <- cbind(patient_id = rownames(use_inc_freq3), use_inc_freq3)
use_inc_freq3_time <- left_join(use_inc_freq3, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 


use_inc_tab <- rbind(
  rbind(data.frame(label = "at care center", levels = '',
                   Male = '', Female = '', Total = ''),
        use_inc_freq1_time %>%
          summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                          "PC_emergency_service","PC_others",
                                          "Specialized_visit_physician",
                                          "Specialized_visit_nurse",
                                          "Specialized_visit_unknown_professional"),
                             total_col = TRUE,cont="mean",cont_cut=1,
                             na_include = TRUE,na_to_prop =  FALSE)),
  rbind(data.frame(label = "at home", levels = '',
                   Male = '', Female = '', Total = ''),
        use_inc_freq2_time %>%
          summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                          "PC_emergency_service","PC_others",
                                          "Specialized_visit_physician",
                                          "Specialized_visit_nurse",
                                          "Specialized_visit_unknown_professional"),
                             total_col = TRUE,cont="mean",cont_cut=1,
                             na_include = TRUE,na_to_prop =  FALSE)),
  rbind(data.frame(label = "by telephone or similar", levels = '',
                   Male = '', Female = '', Total = ''),
        use_inc_freq3_time %>%
          summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                          "PC_emergency_service","PC_others",
                                          "Specialized_visit_physician",
                                          "Specialized_visit_nurse",
                                          "Specialized_visit_unknown_professional"),
                             total_col = TRUE,cont="mean",cont_cut=1,
                             na_include = TRUE,na_to_prop =  FALSE)))

kable(use_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 3: Use of Primary Care Services per year of incidents patients
label levels Male Female Total
at care center
PC_physician Mean (SD) 5.6 (8.2) 6.1 (8.2) 5.8 (8.2)
PC_nurse Mean (SD) 9.4 (12.9) 8.9 (8.7) 9.2 (11.3)
PC_social_worker Mean (SD) 0.2 (1.1) 0.3 (1.5) 0.3 (1.3)
PC_emergency_service Mean (SD) 1.5 (10.5) 1.6 (5.9) 1.5 (8.8)
PC_others Mean (SD) 0.0 (1.2) 0.0 (0.2) 0.0 (0.9)
Specialized_visit_physician Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_nurse Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_unknown_professional Mean (SD) 7.8 (24.9) 6.5 (16.8) 7.2 (21.8)
at home
PC_physician Mean (SD) 2.9 (8.3) 3.4 (9.8) 3.2 (9.1)
PC_nurse Mean (SD) 5.0 (14.9) 6.1 (17.5) 5.6 (16.4)
PC_social_worker Mean (SD) 0.1 (0.4) 0.1 (0.5) 0.1 (0.5)
PC_emergency_service Mean (SD) 0.2 (1.2) 0.2 (1.7) 0.2 (1.5)
PC_others Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_physician Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_nurse Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_unknown_professional Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
by telephone or similar
PC_physician Mean (SD) 4.8 (9.4) 5.5 (8.9) 5.1 (9.1)
PC_nurse Mean (SD) 2.9 (6.8) 3.4 (6.4) 3.1 (6.6)
PC_social_worker Mean (SD) 0.3 (2.7) 0.5 (2.1) 0.4 (2.4)
PC_emergency_service Mean (SD) 0.2 (1.9) 0.3 (1.1) 0.3 (1.6)
PC_others Mean (SD) 0.0 (0.4) 0.0 (0.9) 0.0 (0.7)
Specialized_visit_physician Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_nurse Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_unknown_professional Mean (SD) 0.1 (1.3) 0.1 (1.1) 0.1 (1.2)
Code
patient_pre <- dbGetQuery(con,
                          "SELECT *,
                       CASE WHEN sex = 0 THEN 'Male' 
                       ELSE 'Female' END AS sexo,
                       FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25) 
                       AS 'age',
                       FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25) 
                       AS 'age_dx'FROM main.patient 
                       WHERE dx_date < '2017-01-01'") 
patient_pre <- patient_pre %>%
  left_join( country2cont %>% select(country, CC), by = "country") %>%
  mutate(education=factor(education, levels=c(0,1,2,3),
      labels=c("Without studies","Primary school",
               "High school","University")),
      copayment=factor(copayment,levels=c(0,1),
                       labels=c("less than 18000","more or equal than 18000")))
patient_pre$CC[patient_pre$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")  

patient_pre%>%summary_factorlist(dependent,
                                 explanatory,
                                 total_col = TRUE,
                                 cont="mean",
                                 cont_cut=7,
                                 na_include = TRUE,
                                 na_to_prop =  FALSE, 
                                 include_col_totals_percent =FALSE,
                                 add_col_totals = TRUE)-> patient_pre_tab
kable(patient_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 4: Demographic and socioeconomic characteristics of prevalents patients
label levels Female Male Total
Total N 15454 20496 35950
age Mean (SD) 72.4 (13.2) 67.7 (12.1) 69.7 (12.8)
age_dx Mean (SD) 62.3 (13.7) 58.3 (12.4) 60.0 (13.1)
CC AF 235 (1.5) 371 (1.8) 606 (1.7)
AS 39 (0.3) 56 (0.3) 95 (0.3)
EU 248 (1.6) 356 (1.7) 604 (1.7)
SA 440 (2.9) 315 (1.5) 755 (2.1)
Spain 14354 (93.7) 19314 (94.6) 33668 (94.2)
OC 4 (0.0) 4 (0.0)
copayment less than 18000 12174 (79.2) 12372 (60.5) 24546 (68.5)
more or equal than 18000 3197 (20.8) 8065 (39.5) 11262 (31.5)
(Missing) 83 59 142
education Without studies 3323 (30.5) 3105 (20.5) 6428 (24.7)
Primary school 3386 (31.1) 4660 (30.8) 8046 (30.9)
High school 3628 (33.3) 6155 (40.7) 9783 (37.6)
University 543 (5.0) 1198 (7.9) 1741 (6.7)
(Missing) 4574 5378 9952
avg_income Mean (SD) 14371.0 (2500.8) 14649.2 (2611.4) 14532.5 (2569.2)
Code
param_pre <- dbGetQuery(con,"SELECT * FROM param_prevalents_pre_last_view")
param_cat_pre <- dbGetQuery(con,"SELECT * FROM param_cat_prevalents_pre_last_view")
dependent="sexo"


param_pre2unnest <- param_pre %>% 
  pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
  left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id")

param_cat_pre$param_cat_value = as.numeric(as.character(param_cat_pre$param_cat_value))
param_cat_pre2unnest <- param_cat_pre %>% 
  pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value)%>%
  left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id") %>%
  mutate(physical_activity=factor(physical_activity, levels=c(0,1,2,3),
                                  labels=c("Incapacity", "Inactive", "Partially Active","Active")),
         smoking_status=factor(smoking_status, levels=c(0,1,2,3),
                               labels=c("Non Smoker", "Ex-smoker < 1 year", "Ex-smoker >= 1 year","Smoker")),
         alcohol=factor(alcohol, levels=c(0,1,2),
                        labels=c("Abstinent", "Moderate Drinker","Heavy Drinker")),
         vaccination_covid=factor(vaccination_covid, levels=c(0,1,2,3,4,5,6,7),
                                  labels=c("No", "Yes, BioNTech-Pfizer","Yes, Moderna","Yes, Astrazeneca","Yes, Johnson-Johnson","Yes, Novavax","Yes, Other","Yes, Unknown type")),
         vaccination_flu=factor(vaccination_flu, levels=c(0,1), labels=c("No", "Yes")),
         working_status=factor(working_status,levels=c(0,1,2),labels=c('Active','Unemployed','Pensionist')))

param_pre_tab <- rbind(
  param_pre2unnest %>%
    summary_factorlist(dependent ,
                       c("bmi","height","weight","sbp","dbp","hba1c","col",
                         "hdl","ldl","trgl","creat","indalbcr","alb"),
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,add_row_totals = TRUE,
                       include_row_missing_col = TRUE,
                       include_col_totals_percent   =FALSE,add_col_totals = TRUE),
  param_cat_pre2unnest %>%
    summary_factorlist(dependent ,
                       c("physical_activity","smoking_status","alcohol","vaccination_flu",'working_status'),
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,cont_cut=8,add_row_totals = TRUE,
                       include_row_missing_col = TRUE))
kable(param_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 5: Clinical and lifestyle characteristics of prevalent patients
label Total N Missing N levels Female Male Total
Total N 13234 16902 30136
bmi 17036 (56.5) 13100 Mean (SD) 31.3 (30.7) 31.4 (97.6) 31.3 (76.6)
height 17199 (57.1) 12937 Mean (SD) 154.3 (11.5) 167.6 (14.1) 161.9 (14.6)
weight 19419 (64.4) 10717 Mean (SD) 79.0 (501.4) 85.5 (15.5) 82.7 (328.5)
sbp 25078 (83.2) 5058 Mean (SD) 135.4 (18.0) 135.0 (16.9) 135.2 (17.4)
dbp 25043 (83.1) 5093 Mean (SD) 75.5 (10.6) 76.4 (10.7) 76.0 (10.7)
hba1c 24086 (79.9) 6050 Mean (SD) 7.0 (1.2) 7.0 (1.2) 7.0 (1.2)
col 25567 (84.8) 4569 Mean (SD) 184.4 (38.6) 171.4 (40.9) 177.1 (40.4)
hdl 25438 (84.4) 4698 Mean (SD) 48.2 (12.5) 43.1 (11.4) 45.3 (12.2)
ldl 23910 (79.3) 6226 Mean (SD) 108.1 (31.9) 100.6 (31.3) 104.0 (31.8)
trgl 25474 (84.5) 4662 Mean (SD) 146.9 (85.4) 147.9 (115.5) 147.5 (103.4)
creat 24849 (82.5) 5287 Mean (SD) 0.9 (1.5) 1.1 (2.0) 1.0 (1.8)
indalbcr 21275 (70.6) 8861 Mean (SD) 52.2 (257.2) 71.3 (293.3) 63.2 (278.7)
alb 4346 (14.4) 25790 Mean (SD) 4.1 (0.4) 4.2 (1.0) 4.2 (0.7)
physical_activity 9543 (30.7) 21573 Incapacity 82 (1.9) 49 (0.9) 131 (1.4)
Inactive 607 (14.4) 410 (7.7) 1017 (10.7)
Partially Active 1837 (43.6) 1820 (34.2) 3657 (38.3)
Active 1690 (40.1) 3048 (57.2) 4738 (49.6)
(Missing) 8811 12762 21573
smoking_status 12679 (40.7) 18437 Non Smoker 4355 (81.5) 2970 (40.5) 7325 (57.8)
Ex-smoker < 1 year 57 (1.1) 224 (3.1) 281 (2.2)
Ex-smoker >= 1 year 438 (8.2) 2571 (35.0) 3009 (23.7)
Smoker 492 (9.2) 1572 (21.4) 2064 (16.3)
(Missing) 7685 10752 18437
alcohol 12184 (39.2) 18932 Abstinent 4790 (89.9) 3268 (47.7) 8058 (66.1)
Moderate Drinker 527 (9.9) 3337 (48.7) 3864 (31.7)
Heavy Drinker 11 (0.2) 251 (3.7) 262 (2.2)
(Missing) 7699 11233 18932
vaccination_flu 18067 (58.1) 13049 No 0 (0.0) 0 (0.0) 0 (0.0)
Yes 8229 (100.0) 9838 (100.0) 18067 (100.0)
(Missing) 4798 8251 13049
working_status 24421 (78.5) 6695 Active 1446 (15.4) 2927 (19.4) 4373 (17.9)
Unemployed 0 (0.0) 0 (0.0) 0 (0.0)
Pensionist 7922 (84.6) 12126 (80.6) 20048 (82.1)
(Missing) 3659 3036 6695
Code
patient_pre$deregistration_date <- patient_pre$deregistration_date %>%
  replace_na(as.Date('2023-01-01'))
patient_pre <- patient_pre %>% 
  mutate(follow_up_time=as.numeric(difftime(deregistration_date,                                          as.Date('2016-12-31'),units="days")/365.25))%>%
  filter(follow_up_time!=0)



ss_use_pre <- dbGetQuery(con," SELECT * FROM main.ss_use \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date <'2017-01-01') AND
                         visit_date>='2017-01-01'") %>% mutate(
                           visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
                          labels=c("PC_physician", "PC_nurse", "PC_social_worker",
                                   "PC_emergency_service","PC_others",
                                   "Specialized_visit_physician",
                                   "Specialized_visit_nurse",
                                   "Specialized_visit_unknown_professional"))
                         )

ss_use_pre1 <- ss_use_pre %>% 
  filter(visit_loc == 1)
use_pre_freq1 <- as.data.frame.matrix(table(ss_use_pre1$patient_id,ss_use_pre1$visit_type))
use_pre_freq1 <- cbind(patient_id = rownames(use_pre_freq1), use_pre_freq1)
use_pre_freq1_time <- left_join(use_pre_freq1, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_pre2 <- ss_use_pre %>% 
  filter(visit_loc == 2)
use_pre_freq2 <- as.data.frame.matrix(table(ss_use_pre2$patient_id,ss_use_pre2$visit_type))
use_pre_freq2 <- cbind(patient_id = rownames(use_pre_freq2), use_pre_freq2)
use_pre_freq2_time <- left_join(use_pre_freq2, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_pre3 <- ss_use_pre %>% 
  filter(visit_loc == 3)
use_pre_freq3 <- as.data.frame.matrix(table(ss_use_pre3$patient_id,ss_use_pre3$visit_type))
use_pre_freq3 <- cbind(patient_id = rownames(use_pre_freq3), use_pre_freq3)
use_pre_freq3_time <- left_join(use_pre_freq3, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 


use_pre_tab <- rbind(
  rbind(data.frame(label = "at care center", levels = '',
                   Male = '', Female = '', Total = ''),
        use_pre_freq1_time %>%
    summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                    "PC_emergency_service","PC_others",
                                    "Specialized_visit_physician",
                                    "Specialized_visit_nurse",
                                    "Specialized_visit_unknown_professional"),
                       total_col = TRUE,cont="mean",cont_cut=1,
                       na_include = TRUE,na_to_prop =  FALSE)),
    rbind(data.frame(label = "at home", levels = '',
                     Male = '', Female = '', Total = ''),
        use_pre_freq2_time %>%
    summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                    "PC_emergency_service","PC_others",
                                    "Specialized_visit_physician",
                                    "Specialized_visit_nurse",
                                    "Specialized_visit_unknown_professional"),
                        total_col = TRUE,cont="mean",cont_cut=1,
                        na_include = TRUE,na_to_prop =  FALSE)),
    rbind(data.frame(label = "by telephone or similar", levels = '',
                     Male = '', Female = '', Total = ''),
        use_pre_freq3_time %>%
    summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                    "PC_emergency_service","PC_others",
                                    "Specialized_visit_physician",
                                    "Specialized_visit_nurse",
                                    "Specialized_visit_unknown_professional"),
                        total_col = TRUE,cont="mean",cont_cut=1,
                        na_include = TRUE,na_to_prop =  FALSE)))

kable(use_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 6: Use of Primary Care Services of prevalent patients
label levels Male Female Total
at care center
PC_physician Mean (SD) 5.1 (5.7) 5.6 (6.3) 5.3 (6.0)
PC_nurse Mean (SD) 8.7 (9.3) 8.2 (8.1) 8.5 (8.8)
PC_social_worker Mean (SD) 0.2 (0.7) 0.3 (0.8) 0.3 (0.8)
PC_emergency_service Mean (SD) 1.6 (7.5) 1.8 (7.4) 1.7 (7.5)
PC_others Mean (SD) 0.0 (0.1) 0.0 (0.1) 0.0 (0.1)
Specialized_visit_physician Mean (SD) 0.0 (0.3) 0.0 (0.8) 0.0 (0.6)
Specialized_visit_nurse Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_unknown_professional Mean (SD) 9.0 (17.0) 7.4 (14.2) 8.3 (15.9)
at home
PC_physician Mean (SD) 2.1 (6.4) 2.4 (7.8) 2.3 (7.1)
PC_nurse Mean (SD) 4.4 (12.3) 5.5 (13.3) 5.0 (12.8)
PC_social_worker Mean (SD) 0.1 (0.4) 0.1 (0.4) 0.1 (0.4)
PC_emergency_service Mean (SD) 0.2 (2.7) 0.3 (2.8) 0.3 (2.7)
PC_others Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_physician Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_nurse Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_unknown_professional Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
by telephone or similar
PC_physician Mean (SD) 3.8 (5.8) 4.8 (5.7) 4.3 (5.7)
PC_nurse Mean (SD) 3.1 (5.9) 3.8 (4.8) 3.4 (5.5)
PC_social_worker Mean (SD) 0.3 (1.5) 0.5 (1.4) 0.4 (1.5)
PC_emergency_service Mean (SD) 0.2 (1.0) 0.2 (0.9) 0.2 (0.9)
PC_others Mean (SD) 0.0 (0.1) 0.0 (0.1) 0.0 (0.1)
Specialized_visit_physician Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_nurse Mean (SD) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)
Specialized_visit_unknown_professional Mean (SD) 0.1 (0.3) 0.1 (0.3) 0.1 (0.3)

Patients without predominant clinical condition

Event log’s creation and description

Choosing patients without any predominant clinical condition and after some preprocessing we obtain an event log that can be represented in the below process map Figure 12. There is shown how the events are connected, the mean number of days patients spent in each event (or treatment), percentage of patients who made each transition and the mean number of days it took to make the transition. However, a spaghetti map is obtained and nothing can be concluded. Therefore, we have to simplify the process map and for example only show the most frequent traces covering 20% of the event log as in Figure 13. Moreover, in Figure 14 we show percentage of patients’ traces each activity is present.

Figure 12: Event log’s process maps with all traces

Figure 13: Event log’s process maps with most frequent traces covering 20%

Figure 14: Percentage of patients’ traces an activity is present

Clustering traces

Once the set of traces to analyze are selected, there is a need to choose a distance measure to clustering. In this example Levenshtein similarity is chosen to calculate the distance matrix.

When distance matrix is acquired, we are able to cluster. However, the number of clusters have to be fixed before. With these figures we are able to conclude what could be the optimal number of clusters.

Figure 15: Distance matrix’s dendogram

Choosing the optimal number of clusters, too small clusters can appear, but we can exclude those of less than 30 traces to avoid having clusters of low representation. The process map and the most frequent traces of one of these clusters that remain are the followings.

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 16: Cluster 0

Conformace checking

Once traces are clusterized, with a boxplot is easy to show that each cluster’s behavior with respect to the treatment guides is different. Comparing traces with Figure 6, calculating the fitness (1 being perfect match with the petri net and 0 being the lowest possible fitness) of each trace and grouping by cluster results in_ Figure 17.

Figure 17: Traces fitness distribution by cluster

Decision mining

In Figure 18 is shown how patients’ age, sex and copayment would influence in prescribing three drugs based treatment. More features could have been added but for simplicity we included those three.

Figure 18: Decision tree of place9 (triple treatment prescription step)

Variables relevance in the previous decision tree is indicated in Figure 19. As we can see, categorical variables are divided into as many parts as the number of categories there are and variables’ total relevance is the sum of its categories’ values.

Figure 19: Features importance in place9 (triple treatment prescription step)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 18857, number of events= 990 

             coef exp(coef)  se(coef)      z Pr(>|z|)   
age     -0.008555  0.991482  0.003088 -2.770   0.0056 **
sex1     0.055093  1.056639  0.065477  0.841   0.4001   
fitness -0.405129  0.666891  0.190500 -2.127   0.0334 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9915     1.0086    0.9855    0.9975
sex1       1.0566     0.9464    0.9294    1.2013
fitness    0.6669     1.4995    0.4591    0.9687

Concordance= 0.531  (se = 0.01 )
Likelihood ratio test= 14.24  on 3 df,   p=0.003
Wald test            = 14.59  on 3 df,   p=0.002
Score (logrank) test = 14.6  on 3 df,   p=0.002

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 2854 
     Number of observations: 18857 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 990
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 4.8e-18 
                         : likelihood= 2.2e-11 
                         : second derivatives= 2e-08 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 17669.74  
     AIC: -35311.49  
     BIC: -35228.1  
     Score test statistic for CI assumption: 1.275 (p-value=0.5285) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1 -1.59186 0.07111 -22.387 0.00000

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.02946 0.00286  10.318 0.00000
event1 +/-sqrt(Weibull2)  0.94573 0.01249  75.710 0.00000
event1 SurvPH class1      0.47427 0.10197   4.651 0.00000
age                      -0.00829 0.00308  -2.689 0.00717
sex1                      0.04925 0.06544   0.753 0.45168

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.76526 0.01004  76.193 0.00000
intercept class2  0.91381 0.00338 270.658 0.00000
t_0 class1       -0.00043 0.00001 -32.415 0.00000
t_0 class2       -0.00003 0.00000  -7.818 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.01839    
t_0        -0.00001   0

                            coef      Se
Residual standard error  0.04078 0.00024

 
Posterior classification based on longitudinal and time-to-event data: 
  class1  class2
N 405.00 2449.00
%  14.19   85.81
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.8556 0.1444
class2 0.0556 0.9444
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  79.26  94.69
prob>0.8  66.91  90.20
prob>0.9  55.06  83.26
 
 
Posterior classification based only on longitudinal data: 
  class1  class2
N 386.00 2468.00
%  13.52   86.48
 
png 
  2 

Figure 20: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 21: Class-specific mean predicted fitness trajectory

Figure 22: Class-specific survival functions

CV Disease

Event log’s creation and description

Choosing patients with a cardiovascular disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 23: Event log’s process maps with all traces

Figure 24: Event log’s process maps with most frequent traces covering 20%

Figure 25: Percentage of patients’ traces an activity is present

Clustering traces

Figure 26: Distance matrix’s dendogram

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 27: Cluster 0

Conformace checking

Figure 28: Traces fitness distribution by cluster

Decision mining

Figure 29: Decision tree of place9 (triple treatment prescription step)

Figure 30: Features importance in place9 (triple treatment prescription step)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 2854, number of events= 261 

             coef exp(coef)  se(coef)      z Pr(>|z|)
age     -0.000683  0.999317  0.005923 -0.115    0.908
sex1    -0.006853  0.993170  0.138269 -0.050    0.960
fitness -0.080073  0.923049  0.467042 -0.171    0.864

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9993      1.001    0.9878     1.011
sex1       0.9932      1.007    0.7574     1.302
fitness    0.9230      1.083    0.3696     2.306

Concordance= 0.513  (se = 0.02 )
Likelihood ratio test= 0.05  on 3 df,   p=1
Wald test            = 0.05  on 3 df,   p=1
Score (logrank) test = 0.05  on 3 df,   p=1

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 462 
     Number of observations: 2854 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 261
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 8.3e-11 
                         : likelihood= 2.3e-05 
                         : second derivatives= 3.2e-06 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 1586.47  
     AIC: -3144.94  
     BIC: -3087.04  
     Score test statistic for CI assumption: 4.88 (p-value=0.0872) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1 -0.63657 0.39696  -1.604 0.10880

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.02924 0.00618   4.733 0.00000
event1 +/-sqrt(Weibull2)  1.01006 0.03934  25.672 0.00000
event1 SurvPH class1      1.62751 0.34931   4.659 0.00000
age                      -0.00401 0.00692  -0.579 0.56288
sex1                      0.06002 0.16440   0.365 0.71505

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.60431 0.01214  49.795 0.00000
intercept class2  0.58151 0.00866  67.156 0.00000
t_0 class1       -0.00052 0.00007  -7.196 0.00000
t_0 class2       -0.00022 0.00002  -9.623 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.01213    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.04784 0.00075

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  164.0  298.0
%   35.5   64.5
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.6999 0.3001
class2 0.1512 0.8488
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  45.12  76.17
prob>0.8  15.85  63.09
prob>0.9   3.66  52.68
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N  87.00 375.00
%  18.83  81.17
 
png 
  2 

Figure 31: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 32: Class-specific mean predicted fitness trajectory

Figure 33: Class-specific survival functions

Heart Failure

Event log’s creation and description

Choosing patients with heart failure and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 34: Event log’s process maps with all traces

Figure 35: Event log’s process maps with most frequent traces covering 20%

Figure 36: Percentage of patients’ traces an activity is present

Clustering traces

Figure 37: Distance matrix’s dendogram

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 38: Cluster 0

Conformace checking

Figure 39: Traces fitness distribution by cluster

Decision mining

Figure 40: Decision tree of place9 (triple treatment prescription step)

Figure 41: Features importance in place9 (triple treatment prescription step)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 1166, number of events= 126 

             coef exp(coef)  se(coef)      z Pr(>|z|)  
age      0.018126  1.018291  0.008138  2.227   0.0259 *
sex1    -0.227212  0.796752  0.185171 -1.227   0.2198  
fitness -1.035612  0.355009  0.890678 -1.163   0.2449  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0183      0.982   1.00218     1.035
sex1       0.7968      1.255   0.55425     1.145
fitness    0.3550      2.817   0.06196     2.034

Concordance= 0.568  (se = 0.029 )
Likelihood ratio test= 6.71  on 3 df,   p=0.08
Wald test            = 6.46  on 3 df,   p=0.09
Score (logrank) test = 6.47  on 3 df,   p=0.09

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 210 
     Number of observations: 1166 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 126
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 1e-18 
                         : likelihood= 2.3e-13 
                         : second derivatives= 1.2e-11 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 530.95  
     AIC: -1033.89  
     BIC: -987.04  
     Score test statistic for CI assumption: 6.543 (p-value=0.0379) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1 -0.19870 0.28644  -0.694 0.48786

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.01661 0.00568   2.924 0.00346
event1 +/-sqrt(Weibull2)  1.00354 0.04370  22.965 0.00000
event1 SurvPH class1      1.65383 0.30592   5.406 0.00000
age                       0.01181 0.00907   1.302 0.19287
sex1                      0.05210 0.21029   0.248 0.80432

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.62505 0.01381  45.268 0.00000
intercept class2  0.57986 0.01318  43.986 0.00000
t_0 class1       -0.00064 0.00006 -10.823 0.00000
t_0 class2       -0.00025 0.00002 -10.255 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept    0.0104    
t_0          0.0000   0

                            coef      Se
Residual standard error  0.04711 0.00119

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  101.0  109.0
%   48.1   51.9
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.8045 0.1955
class2 0.1224 0.8776
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  85.15  78.90
prob>0.8  52.48  73.39
prob>0.9  20.79  62.39
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N  81.00 129.00
%  38.57  61.43
 
png 
  2 

Figure 42: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 43: Class-specific mean predicted fitness trajectory

Figure 44: Class-specific survival functions

Chronic kidney disease

Event log’s creation and description

Choosing patients with chronic kidney disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 45: Event log’s process maps with all traces

Figure 46: Event log’s process maps with most frequent traces covering 20%

Figure 47: Percentage of patients’ traces an activity is present

Clustering traces

Figure 48: Distance matrix’s dendogram

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 49: Cluster 0

Conformace checking

Figure 50: Traces fitness distribution by cluster

Decision mining

Figure 51: Decision tree of place9 (triple treatment prescription step)

Figure 52: Features importance in place9 (triple treatment prescription step)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 1131, number of events= 118 

             coef exp(coef)  se(coef)      z Pr(>|z|)
age      0.005870  1.005887  0.009487  0.619    0.536
sex1     0.076895  1.079928  0.197359  0.390    0.697
fitness -0.153373  0.857810  0.804843 -0.191    0.849

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0059     0.9941    0.9874     1.025
sex1       1.0799     0.9260    0.7335     1.590
fitness    0.8578     1.1658    0.1771     4.154

Concordance= 0.513  (se = 0.033 )
Likelihood ratio test= 0.79  on 3 df,   p=0.9
Wald test            = 0.78  on 3 df,   p=0.9
Score (logrank) test = 0.78  on 3 df,   p=0.9

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 242 
     Number of observations: 1131 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 118
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 4.6e-11 
                         : likelihood= 5.6e-07 
                         : second derivatives= 8.1e-08 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 549.78  
     AIC: -1071.56  
     BIC: -1022.72  
     Score test statistic for CI assumption: 0.083 (p-value=0.9593) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se   Wald p-value
intercept class1 -0.93966 0.51917 -1.810 0.07031

Parameters in the proportional hazard model:

                             coef      Se   Wald p-value
event1 +/-sqrt(Weibull1)  0.02621 0.00943  2.781 0.00542
event1 +/-sqrt(Weibull2)  1.01488 0.04622 21.958 0.00000
event1 SurvPH class1      1.33861 0.36856  3.632 0.00028
age                       0.00432 0.01022  0.423 0.67230
sex1                      0.03075 0.20970  0.147 0.88341

Fixed effects in the longitudinal model:

                     coef      Se   Wald p-value
intercept class1  0.60418 0.01885 32.052 0.00000
intercept class2  0.58567 0.01013 57.818 0.00000
t_0 class1       -0.00064 0.00010 -6.221 0.00000
t_0 class2       -0.00025 0.00003 -8.815 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.01152    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.04398 0.00117

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  55.00 187.00
%  22.73  77.27
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.6624 0.3376
class2 0.1688 0.8312
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  34.55  79.68
prob>0.8  16.36  61.50
prob>0.9   7.27  42.78
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N  33.00 209.00
%  13.64  86.36
 
png 
  2 

Figure 53: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 54: Class-specific mean predicted fitness trajectory

Figure 55: Class-specific survival functions

Frailty

Event log’s creation and description

Choosing patients with frailty and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 56: Event log’s process maps with all traces

Figure 57: Event log’s process maps with most frequent traces covering 20%

Figure 58: Percentage of patients’ traces an activity is present

Clustering traces

Figure 59: Distance matrix’s dendogram

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 60: Cluster 0

Conformace checking

Figure 61: Traces fitness distribution by cluster

Decision mining

Figure 62: Decision tree of place9 (triple treatment prescription step)

Figure 63: Features importance in place9 (triple treatment prescription step)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 6535, number of events= 416 

             coef exp(coef)  se(coef)      z Pr(>|z|)  
age     -0.024846  0.975460  0.010359 -2.398   0.0165 *
sex1    -0.005628  0.994388  0.101149 -0.056   0.9556  
fitness  0.392983  1.481394  0.229867  1.710   0.0873 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9755      1.025    0.9559    0.9955
sex1       0.9944      1.006    0.8156    1.2124
fitness    1.4814      0.675    0.9441    2.3245

Concordance= 0.547  (se = 0.015 )
Likelihood ratio test= 9.86  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 9.69  on 3 df,   p=0.02

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 1080 
     Number of observations: 6535 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 416
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  8 
     Convergence criteria: parameters= 2.2e-07 
                         : likelihood= 9.6e-06 
                         : second derivatives= 1.9e-09 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 4983.84  
     AIC: -9939.69  
     BIC: -9869.9  
     Score test statistic for CI assumption: 12.927 (p-value=0.0016) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1 -0.80123 0.06695 -11.967 0.00000

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.07159 0.03099   2.310 0.02088
event1 +/-sqrt(Weibull2)  0.98092 0.01954  50.212 0.00000
event1 SurvPH class1      0.24610 0.10677   2.305 0.02116
age                      -0.02436 0.01067  -2.283 0.02244
sex1                     -0.00839 0.11212  -0.075 0.94035

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.97253 0.00443 219.532 0.00000
intercept class2  0.62371 0.00289 216.048 0.00000
t_0 class1       -0.00007 0.00002  -3.831 0.00013
t_0 class2       -0.00021 0.00001 -18.567 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.00477    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.04363 0.00045

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  337.0  743.0
%   31.2   68.8
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.9833 0.0167
class2 0.0043 0.9957
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  97.63  99.73
prob>0.8  97.33  99.46
prob>0.9  97.03  98.79
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N 336.00 744.00
%  31.11  68.89
 
png 
  2 

Figure 64: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 65: Class-specific mean predicted fitness trajectory

Figure 66: Class-specific survival functions

Obesity

Event log’s creation and description

Choosing patients with obesity and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 67: Event log’s process maps with all traces

Figure 68: Event log’s process maps with most frequent traces covering 20%

Figure 69: Percentage of patients’ traces an activity is present

Clustering traces

Distance matrix’s dendogram

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 70: Cluster 0

Conformace checking

Figure 71: Traces fitness distribution by cluster

Decision mining

Figure 72: Decision tree of place9 (triple treatment prescription step)

Figure 73: Features importance in place9 (triple treatment prescription step)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 12121, number of events= 717 

             coef exp(coef)  se(coef)      z Pr(>|z|)    
age     -0.011723  0.988345  0.003468 -3.381 0.000723 ***
sex1     0.064597  1.066729  0.075190  0.859 0.390277    
fitness -0.132166  0.876195  0.172409 -0.767 0.443327    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9883     1.0118    0.9817    0.9951
sex1       1.0667     0.9374    0.9206    1.2361
fitness    0.8762     1.1413    0.6250    1.2284

Concordance= 0.534  (se = 0.013 )
Likelihood ratio test= 12.47  on 3 df,   p=0.006
Wald test            = 12.77  on 3 df,   p=0.005
Score (logrank) test = 12.78  on 3 df,   p=0.005

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 1956 
     Number of observations: 12121 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 717
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  6 
     Convergence criteria: parameters= 9.7e-08 
                         : likelihood= 2e-05 
                         : second derivatives= 9.9e-09 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 8370.54  
     AIC: -16713.09  
     BIC: -16634.98  
     Score test statistic for CI assumption: 15.455 (p-value=4e-04) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1  0.43719 0.04950   8.832 0.00000

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.03611 0.00381   9.480 0.00000
event1 +/-sqrt(Weibull2)  0.95903 0.01477  64.925 0.00000
event1 SurvPH class1      0.00986 0.08130   0.121 0.90344
age                      -0.01204 0.00348  -3.462 0.00054
sex1                      0.07040 0.07527   0.935 0.34964

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.61656 0.00295 209.090 0.00000
intercept class2  0.96074 0.00384 250.358 0.00000
t_0 class1       -0.00016 0.00001 -16.278 0.00000
t_0 class2       -0.00017 0.00001 -13.119 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.00724    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.04800 0.00036

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N 1201.0  755.0
%   61.4   38.6
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.9736 0.0264
class2 0.0254 0.9746
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  97.67  97.75
prob>0.8  96.34  97.48
prob>0.9  89.76  91.39
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N 1201.0  755.0
%   61.4   38.6
 
png 
  2 

Figure 74: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 75: Class-specific mean predicted fitness trajectory

Figure 76: Class-specific survival functions

Process indicators’ analysis

We selected incident patients with at least one year of follow-up to create a process indicator’s event log. This log includes cholesterol, albumin-creatinine index, glycated hemoglobin, body mass index, blood pressure and glomerular filtration measures, and foot and eye examinations.

An activity presence analysis is done to see the percentage of presence of each indicator in analyzed traces.

Figure 77: Percentage of patients’ traces an activity is present during the first year

Figure 78: Percentage of patients’ traces an activity is present during the second year

Figure 79: Percentage of patients’ traces an activity is present during the third year

Figure 80: Percentage of patients’ traces an activity is present during the fourth year

Figure 81: Percentage of patients’ traces an activity is present during the fifth year

Conformace checking

As has been done before, in this section, the observed traces are compared with a specific theoretical process. However, instead of utilizing a single Petri net, five Petri nets are employed, one for each year, as the adherence is now considered in annual intervals from diabetes detection date.

Boxplot of process indicators’ traces’ fitness by cumulative years)

Treatments’ analysis’ and process indicators’ fitness are linked with the next plot:

Boxplot of process indicators’ traces’ fitness by cumulative year and by predominant clinical condition)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependent Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 13661, number of events= 2503 

            coef exp(coef) se(coef)     z Pr(>|z|)    
age     0.013055  1.013140 0.001561 8.362  < 2e-16 ***
sex1    0.024150  1.024444 0.040948 0.590 0.555343    
fitness 0.255482  1.291083 0.075443 3.386 0.000708 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.013     0.9870    1.0100     1.016
sex1        1.024     0.9761    0.9454     1.110
fitness     1.291     0.7745    1.1136     1.497

Concordance= 0.55  (se = 0.006 )
Likelihood ratio test= 87.23  on 3 df,   p=<2e-16
Wald test            = 86.42  on 3 df,   p=<2e-16
Score (logrank) test = 86.5  on 3 df,   p=<2e-16

Using same predictive variables a joint latent class model of three classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 6144 
     Number of observations: 13661 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 2503
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 1.2e-11 
                         : likelihood= 7.3e-05 
                         : second derivatives= 6.9e-06 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: -15971.76  
     AIC: 31971.52  
     BIC: 32065.65  
     Score test statistic for CI assumption: 1.365 (p-value=0.5053) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1  1.45062 0.03838  37.794 0.00000

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.01496 0.00090  16.654 0.00000
event1 +/-sqrt(Weibull2)  0.95739 0.00809 118.361 0.00000
event1 SurvPH class1      0.17184 0.05716   3.006 0.00264
age                       0.01308 0.00156   8.362 0.00000
sex1                      0.02767 0.04080   0.678 0.49756

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.79629 0.00265 300.629 0.00000
intercept class2  0.18936 0.00647  29.253 0.00000
t_0 class1       -0.00006 0.00000 -13.953 0.00000
t_0 class2        0.00026 0.00001  26.066 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept    0.0199    
t_0          0.0000   0

                            coef      Se
Residual standard error  0.08374 0.00097

 
Posterior classification based on longitudinal and time-to-event data: 
   class1  class2
N 4994.00 1150.00
%   81.28   18.72
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.9798 0.0202
class2 0.0733 0.9267
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  98.02  89.22
prob>0.8  97.20  84.09
prob>0.9  94.79  78.70
 
 
Posterior classification based only on longitudinal data: 
   class1  class2
N 4990.00 1154.00
%   81.22   18.78
 
png 
  2 

Figure 82: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 83: Class-specific mean predicted fitness trajectory

Figure 84: Class-specific survival functions

Joint latent class models’ summary

(a) ‘else’ patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(b) ‘else’ patients’ class-specific mean predicted treatments’ fitness trajectory

(c) ‘else’ patients’ class-specific survival functions

(d) cvd patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(e) cvd patients’ class-specific mean predicted treatments’ fitness trajectory

(f) cvd patients’ class-specific survival functions

(g) hf patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(h) hf patients’ class-specific fitness mean predicted treatments’ fitness trajectory

(i) hf patients’ class-specific survival functions

(j) ckd patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(k) ckd patients’ class-specific mean predicted treatments’ fitness trajectory

(l) ckd patients’ class-specific survival functions

(m) Fragile patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(n) Fragile patients’ class-specific mean predicted treatments’ fitness trajectory

(o) Fragile patients’ class-specific survival functions

(p) Obese patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(q) Obese patients’ class-specific mean predicted treatments’ fitness trajectory

(r) Obese patients’ class-specific survival functions

(s) Class-specific weighted marginal and subject-specific mean predicted process indicators’ fitness trajectories

(t) Class-specific mean predicted process indicators’ fitness trajectory

(u) Class-specific survival functions

Figure 85: Joint latent class models’ plots’ panel